In my previous blog post I praised the open-source database Orient DB. I’ve received many email from people since asking me questions about Orient DB or telling me that they’d love to try Orient DB out but don’t know where to start.

I though it is time for me to contribute to the Orient DB community and write a short tutorial for exactly this audience. Ideally, I'd like to ensure that if you sit through the videos below, you'd at least have an informed view of whether Orient DB is of interest to you, your project and your organization.

Think of this blog post as the first installment. I may come back with some more advanced tutorials later. This tutorial shows you how to:
  • Download and install Orient DB
  • Play with the Orient DB studio and perform queries on the Grateful Dead database provided with the community edition
  • Create your own database (again using Orient DB studio, perhaps later I’ll provide a tutorial to show how to do this from Node or Java)
  • Create server-side functions
  • Perform graph- and document-based queries over Orient DB
I hope this tutorial gives you a good idea of how Orient DB works and some of its power. I also hope it will help illustrate the reasons for my enthusiasm in the previous post where I argued the case for Orient DB. In this post I’ll focus on how you get started.

Introduction (or, Read Me First!)

I had some of my guys review the videos for this tutorial and they pointed out that I assumed that the viewer/reader knew something about graphs. Although this may be a bad assumption, I don’t want to patronize you.

This is my attempt at giving you a short introduction to the concepts of graphs and the vocabulary I'll use in the tutorial.

Orient DB is not only a graph database. It supports all (or nearly all) the features of a document database and an object-oriented database. Hence, discussing only graph theory may be a bit misleading. However, since I’m using primarily the Graph API of Orient DB in this tutorial, it may be prudent of me to ensure you know something about graphs.
Here is what you need to know.
The above picture is taken from the Wikipedia article on Graph Databases. The illustration shows 3 objects (as circles) that we keep track of. There are also lines (or arrows) that define relationship between the objects. We call an object vertex and a relationship edge.

I do have some issues with the graph. They always show two edges that only differs based on the direction they are read. I would have preferred a single line (or a single edge). However, since it is likely that you would lookup Graph Databases in Wikipedia, I thought it would be better if I showed their example and criticized it (OK, I will go and suggest another drawing for Wikipedia. It is on my to do list. Honestly!)

A graph (for the purpose of this tutorial) consists of:
  • Vertices. A vertex is a cluster of information. The information is typically stored in key-value pairs (aka properties). You may think of it as a map (or I’ve notice that many developers call these maps hash tables, however, I find this misleading as hash is a particular strategy for arranging keys in one of the map implementations).
    One way of seeing a vertex is as an identifiable set of properties, where a property is defined as a key-value pair. Unique to Orient DB, the properties can be nested to form complex data structures. I will not do any nesting in this tutorial. I do, however, plan to write another tutorial where I explore Orient DB as a document database.
  • Edges. An edge is a link between two vertices. Edges may also have properties. Edges have have direction. That is, they start in one vertex and end in another vertex (actually, to be correct, they may also end in the same vertex. Such edge has the cute name ‘buckle’ in graph theory).
That’s it! Really.

Most of you probably have some background in relational databases, so let me give you a comparison between OrientDB and typical relational databases. The comparison will be slightly misleading no matter how I present it. To minimize the confusion, I’ll decided to add a column to discuss how the two technologies differs also.
Graph DB Relational DB Differences
Vertex Row A row in a RDB is a flat set of properties. In Orient DB the vertex may be an arbitrarily complex structure. If you are familiar with document databases, it is basically a document.
Edge Relationship A relationship in a relational database is not a first class citizen. The relationship is made ad-hoc by use of joins on keys. In Orient DB (as in all Graph Databases) the edge (relationship) is a first class citizen. This mean it can have an id, have properties, etc.
You may ask, where is the comparison of schema constructs such as table, column, trigger, store-procedure, etc. I think the tutorial does discuss these constructs, so I’ll delay the discussion to later in this tutorial.

In the second tutorial, I will repeat this short theory session. To help you progress from this incomplete theory session (and for your reading pleasure), I’ve provided a few links below:
I decided to use Orient DB Studio in the videos. It is probably the least likely tool that you’ll use to access the database. Most developers prime exposure to Orient DB is through language bindings or libraries. I was thinking of selecting one of these environments and show how Orient DB is used in there. The problem is that if I picked one of the environments I would risk alienating the ones using other environments. I think no matter what environment you're using, you would at some point would bring up Orient DB studio, but perhaps I’m wrong and I just alienated everyone. I hope not!
If time permits, I’ll come back with separate tutorials for how to use Orient DB from Java/Scala/Python/Node.js/...

Part 1: Download and Install Orient DB

I’ve prepared a simple video that explains how to download and install Orient DB. I’m on a Mac, hence the video shows how to install it on Macs. I do, however, think it should be easy  transpose the steps to Windows or Linux.

At the time I created this video, the latest version of Orient DB was 1.6.2 (things are moving rapidly at OrientDB, and I noticed they already released 1.6.3 before I got a chance to publish the blog). Orient DB releases many versions per year, so there is a risk that at the time you read this tutorial, my instructions may be slightly out of date. All the steps that I followed in the video can be found on the Orient Technologies website and I’m sure it will be kept up to date. Hence, if the version you downloaded is different from 1.6.2, you may be better off following the up-to-date instructions on the Orient DB sites.

A bit of warning: Towards the end I do execute a few queries to ensure that everything works. Don't despair if you don't understand what I'm doing. All I wanted to was ensure that I'd succeeded in installing Orient DB.

 

Part 2: Play Around with the Grateful Dead Graph

Orient DB comes with an example database instance. The provided database contains a graph defining performances of Grateful Dead. The graph is sparse and the information model is quite simple. This is unfortunate as it is hard to illustrate all the power of Orient DB with this graph. Perhaps at a later time I’ll try to find an open source dataset that has more interesting information (suggestions welcomed!).
The Grateful Dead graph can be illustrated with the following UML model:
NewImage
The model above is as truthful as I could make it. I did take exercise some artistic 'enhancements'. This to make the model easier to read. Here are my embellishments:
  • Notice the attribute on the ‘followed_by’ association. Vrious versions of UML have provided different ways to depicting attributes on relationships. Let me explain what I mean by my way of using UML. The example database stores a count the number of times a song has been followed by another song in a concert (I did not create the model, so I’m providing my own interpretation). The weight is a count of the number of times. To illustrate this in UML I have to show properties on associations.
    Conceptually, this is not a problem for a Graph Database because it can store properties on edges. It may, however, awkward to see the mapping to other technologies.
  • I decided to expand the enumerated type specifying the "song_type" . A more common way to model this would be the use of an enum-type. However, I think ‘display-wise’ the expanding the enum is better here as it is only used in one place.
  • Another ‘cheat’ is the introduction of classes. The example database do not actually use custom classes (all the objects are instances of a class called ‘V’; short for  vertex). They do, however, provide a property called ‘type’ that is used for the purpose of typing. I believe the model I created is better at illustrating the original intent of the model. 
I would assume only a percentage of the readers are familiar with Grateful Dead, so let me give you a few sentences about them. Grateful Dead is a cult rock band that started playing in California in the 60’s. I assume the graph is populated from the same data set that produced this  illustration:
 
It is really not important who the Grateful Dead were and whether you like them or not. Just think of them as some rock group that performed a set of songs very many times. Someone cared enough to capture when what songs they performed, what song followed another song, who wrote the song, and who sung the song.

With that introduction, let’s move on to the video where I create a set of queries and navigate around the example database.
Just for your reference, here are some of the queries I used
  • Select all vertices (or object) in the database
    • select * from V
  • Select the vertex with the id #9:8
    • select * from V where @rid=#9:8
    • This could also be written as:
      • select * from #9:8 
  • Select all the artists
    • select * from V where type=‘artist'
  • Select all the songs that have been performed more than 10 times
    • select * from V where type = 'song’ and performances > 10
  • Count all songs
    • select count(*) from V where type='song'
  • Count all artists
    • select count(*) from V where type=‘artist'
  • Find all songs sung by the artist with the id #9:8 (notice, the result will include the artist with id #9:8)
    • traverse in_sung_by from (select * from V where @rid=#9:8)
  • Find only songs sung by the artist with id #9:8 (only songs)
    • select * from (traverse in_sung_by from (select * from V where @rid=#9:8)) where type=‘song’
    • This could also have been written as:
      • select expand(set(in('sung_by'))) from #9:8
  • Find authors of songs sung by the artist with the id #9:8
    • select * from ( traverse out_written_by from (select * from ( traverse in_sung_by from ( select * from V where @rid=#9:8 ) ) where type='song') ) where type=‘artist'
    • Or more effectively:
      • select expand(set(in('sung_by').out('written_by'))) from #9:8 

Creating Our Own Database

In this section, I'll explore how you can create your own database. I'll be using custom classes in this tutorial (unlike the Grateful Dead database that only used the built in 'V' class).
NewImage
The following video shows how we can create a database that uses the above model as its schema:


Below, I've included the statements required to create the database used in the above video.
create class Member extends V
create property Member.name string
alter property Member.name min 3
create property Member.password string
create  property Member.email string
create class Article extends V
create property Article.title string
alter property Article.title min 3
create property Article.title string
create class follows extends E
create class replies extends
create class authors extends

Populating the database

In the previous section we defined a database schema. That means that we’ve prepared Orient DB to store and enforce constraints on objects we’ll insert into this database. It would be possible to store the same information without the schema (in so called schema-free mode). However, since we decided to go down the route of schema-full use, Orient DB can help us enforce the constraints on the data structures.

The video below explores different ways of inserting data into the schema we created in the previous step.

If you don’t want to sit through the video, I’ve included some functions/key script fragments below the video for you to instantiate the database as you please.

Important: For some of the functions that I show in the video to work, you have to add a handler to the Orient DB-server-config.xml. Please add the following XML fragment to the oriented-server-config.xml file (this step is covered in the video):
<handler class="com.orientechnologies.orient.graph.handler.OGraphServerHandler">
   <parameters />
</handler>
Why are you doing the above step you may wonder? It is to ensure that the default handler used in the JavaScript functions is of the graph type.


Some key syntax for inserting data:
create vertex {Vertex Class Name} set ({propertyname=value},…)
create edge {EDGE CLASS NAME} from {OUT_VERTEX_ID} to {IN VERTEX ID}

Part 5: Querying the Database (Again)

Now that we have some data (assuming you finished step 4), let’s see if we can define some interesting queries. If you want to try it out without seeing what I do, here are some suggestions for queries to formulate:
  • Given a member X:
    • Lookup everyone that follows X.
    • Lookup all articles that X posted that was eventually replied to.
    • Lookup everyone that at one time participated in an article that originated with a member X.
  • Find all members that have answered some article posted by someone else.
  • If I changed the content of an article, who would be effected by this (the members that posted articles prior to this article and the members posting articles as relies to this article.


Summary

In this post I’ve tried to give you a head start on Orient DB. Most of my use of Orient DB in this tutorial has been graph-oriented. If time permits, I’ll try to make another tutorial later that expose the power of Orient DB as a document database.

I hope you learned something. Please don’t feel shy about leaving comments. I actually moderate the comments as I’ve had a very large number of spam posts. If you’re comment doesn’t appear immediately, please do not disappear. As soon as I get a chance to make sure the comment is not spam, I’ll allow it and try to answer you.
9

View comments

  1. Petter very nice job, I have been working with OrientDb for a little while now. I made a little custom rest adapter to allow for using isomorphics smartclient framework with Orient. One of my biggest problems now in actually building a database is schema design coming from a relational and going to graph. Not alot of docs beyond simple examples. I am looking at designing a CRM app with Orient and still struggling with graph schema,design like with graph when does link and linkset come into play vs just an edge, but you tutorial has helped. Would love to see more on this subject, Nice job!

    Thanks,
    Dan

    ReplyDelete
  2. Excellent tutorial. thanks for posting.

    ReplyDelete
  3. Hi,
    I need for help!
    Assume I have query : select Mail, expand( both('Friend') ) from User where Name='hoang'
    Query return all field. I only want to get field Name, Mail. How can i do that.
    Thank,

    ReplyDelete
    Replies
    1. Can't you just wrap a simple query around your other query?

      Delete
  4. Really excellent tutorial, thanks for taking the time to put this together! I was able to follow along perfectly until 8:14 of Tutorial video #4, when I got the following error:

    "Error on parsing script at position #0: Error on execution of the script Script: createSomeMembers ------^ sun.org.mozilla.javascript.internal.EcmaError: ReferenceError: "gdb" is not defined. (#11) in at line number 11"

    It seems obvious that the reference to the database is broken, but I cannot find anywhere in the OrientDB docs where a "gdb.save()" command exists, and the complexity of SQL vs graph is a bit confusing when searching in the docs for the right spot.

    But in spite of this current roadblock, your tutorial has been really excellent and is very much appreciated! Thank you!

    ReplyDelete
    Replies
    1. Did you setup the graphical database? I show you how to do that in the first video...

      Delete
    2. same error occured when executing the function..

      Delete
  5. This comment has been removed by the author.

    ReplyDelete
  6. https://groups.google.com/forum/#!topic/orient-database/BOueQD8hMOA

    Any thought on this ?

    ReplyDelete
  1. Today, I want to introduce the word 'dugnad' (pronounced ˈdʉːgnɑd]) to my friends and colleagues.  
    Dugnad is a word from Old Norse and it is wrongly translated as 'volunteer work' in the English dictionary. Dugnad has a much richer meaning and tradition in Norway. Dugnad is when a community comes together to fix a problem in their community. 
    When I lived in Norway, it was usually used to describe a common effort like cleaning up a common area in your neighborhood or perhaps your sports club comes together to improve the sports facility.
    The word, dugnad, is now used in Norway to talk about the shared effort required to fight the coronavirus. 
    Source: Wikipedia. Picture of a Dugnad where a group came together to put down a roof 
    I can't find a US word for dugnad, so I am hereby submitting it to the dictionary for inclusion (the last word I know we managed to sneak into the English dictionary was quisling, so it is time the Norwegian language contributes a positive word).
    I have seen how the communities come together in the US as well. A great example is all the healthcare workers that have volunteered for working with the coronavirus patients in New York (76,000 of them at the latest count). Also, Jayde Powell, who started Angle Shopper (see https://www.cnn.com/2020/03/17/us/coronavirus-student-volunteers-grocery-shop-elderly-iyw-trnd/index.html). 
    In fact, the USA is known to come together under a crisis. E.g., think of the effort that the USA put together to fight the Nazis during the second world war. In 1939, the airplane production in the US was 3,000... by the end of the war,  the US produced 300,000 planes.
    I pledge to start my own dugnads. The first will be to share the lessons learned working remotely. I am one of the fortunate ones that work remotely and have been for the last 12 years. As of now, the virus has not hit me and my company because all our work is already remote.
    I also plan to start free online seminars on various topics that I now teach for various companies and universities.
    I'll post the article on how to work remotely here on my blog. I will also start my online seminars as soon as I can figure out which platform is best suited to handle the load (last time I taught a class online, I had 8,000 students and I'm pretty sure my Zoom subscription doesn't handle that :). 
    1

    View comments

  2. Introduction

    I just upgraded to a new Dell Precision Ubuntu-Based laptop from years of using a MacBook. Although for the mosts part, the transition was smooth, one thing frustrated me beyond anything else. The trackpad on the Dell is placed in a location where I constantly touch the trackpad causing my typing to insert in random locations as the mouse is moved.
    This blog explains how I solved it.

    Disable the trackpad

    You can disable the trackpad in Ubuntu using the command xinput

    First, you have to find out what device number your trackpad has been assigned. You can simply do this by entering:

    $ xinput list
    ⎡ Virtual core pointer                     id=2 [master pointer  (3)]
    ⎜   ↳ Virtual core XTEST pointer               id=4 [slave  pointer  (2)]
    ⎜   ↳ AlpsPS/2 ALPS GlidePoint                 id=12 [slave  pointer  (2)]
    ⎣ Virtual core keyboard                    id=3 [master keyboard (2)]
        ↳ Virtual core XTEST keyboard              id=5 [slave  keyboard (3)]
        ↳ Power Button                             id=6 [slave  keyboard (3)]
        ↳ Video Bus                                id=7 [slave  keyboard (3)]
        ↳ Power Button                             id=8 [slave  keyboard (3)]
        ↳ Sleep Button                             id=9 [slave  keyboard (3)]
        ↳ Integrated_Webcam_HD                     id=10 [slave  keyboard (3)]
        ↳ AT Translated Set 2 keyboard             id=11 [slave  keyboard (3)]
        ↳ DELL Wireless hotkeys                    id=13 [slave  keyboard (3)]
        ↳ Dell WMI hotkeys                         id=14 [slave  keyboard (3)]
    

    As you can see above, in my case the touchpad is called AlpPS/2 APLS GlidePoint and it is assigned the id 12.

    I can now disable the trackpad by simply typing:


    $ xinput --disable 12
    

    To enable the trackpad I can type:


    $ xinput --enable 12
    

    With this, I can enter longer edititing sessions with the trackpad disabled and the reneable it again when I need the mouse pointer. However, doing so from the command line is somewhat of a nuance as it requires extra keystrokes (I would have to Alt-Tab to the correct window and type the command).

    Use AutoKey

    AutoKey is a VERY useful tool that allows you to automate many of the tasks that you do (from typing to running complex scripts). You can find more information about AutoKey here.

    To install autokey, look at these instructions

    The cool thing about AutoKey is that you can assign an arbitrary Python program to run when you press some key combination.

    To disable the trackpad I want to run the script:
    system.exec_command("xinput --disable 12")
    

    To reenable the trackpad, you would want to run this python script:
    system.exec_command("xinput --enable 12")
    

    Now, all you have to do is to pick a couple of key combinations that you don't use in your other programs. For me, I have assigned Ctrl-F5 and Alt-F5 as the keys to enable and disable the trackpad.

    In AutoKey, simply assign this script snippet to the key of your choice. E.g.:

    That's it. Now, whenever I want to type and disable the trackpad, I hit Alt-F5 (of course you can assign whatever code you want) and when I need my mouse, I'll reenable it by hitting Ctrl-F5.
    0

    Add a comment

  3.  Introduction

    How do you translate traditional API’s into REST? Many applications face this problem. We may already have an implementation that exposes a typical functional API. In the previous post, I discussed some of the essential properties of a REST API. In this post, I’ll take a simple functional API and convert it to REST resources.

    A Simple Address Book

    Let’s use a simple example of an address book. Although the example is simple, it is interesting enough to spur discussions that we’ll cover in more details later.

    NewImage

    The above diagram is in UML. The interface may manifest itself as a set of Ajax calls or perhaps a simple API in some programming language. E.g. as an interface in Java:

    interface AddressBookService {
    
      public void addContact(Contact c);
    
      public Collection<Contact> getAllContacts();
    
      public Contact findById(String contactId);
    
      public Collection<Contact> findByName(String name);
     
      public Collection<Contact> findByAreaCode(String areaCode);

      public void removeContact(String contactId);
     
      public void updateContact(String contactId, Contact contact);
     
      public Call callContact(Contact c);
    
      public Call dialNumber(String digits);
    
    }
    

    We also need some data structures to carry the information (often called DTO’s). If we stick to Java as our language', we would have to build some JavaBeans that defines the Contact and perhaps also the Call.

    The Challenge

    We want to convert the address book into a set of REST resources. The question is now, what are the resources? 

    NewImage

    You can think of the REST API as an adapter that converts the user’s conceptual model to that of the implantation. Our language for describing the user’s conceptual model is the definition of resources. Our challenge is to find the most logical resources from a user’s perspective.

    The Obvious Resource

    The “contact” seems like an obvious resource. The address book has a set of contacts. Can we define all the functionality of the address book service into simple REST calls around a “contact” resource? 

    If we decide to create a resource around contacts (or more correctly, a resource-set contacts with resource instances such as “John Doe, 512-555-8989”), what functionality do we cover?

    Let’s explore this with a simple table where map REST methods on the contact resource to the original API. Notice that I’ve used ‘..,’ in the URL. The ‘…’ is worth a new blogpost by itself. At PayPal the ‘…’ would be ‘https://api.paypal.com/v1/addressbook'. I’ll just use ‘…’ to indicate ‘some base URL that precedes the resource name’.

    CRUDF URL Verb Original Interface
    Create .../contacts POST addContact
    Read .../contacts/{ID} GET findContactById
    Update .../contacts/{ID} PUT updateContact
    Delete .../contacts/{ID} DELETE removeContacts
    Find .../contracts GET getAllContacts
    findContactsByName
    findContactsByAreacode

    As you can see, most of the methods in this interface had straightforward mappings into REST. However, we’re still left with a few methods (and consequently also features) that have not been mapped and do not have an obvious mapping.

    How About the Calls?

    The two methods that we did not map are:

    • callContact
    • dialNumber

    These methods do not have an obvious resource. Or more correctly, they may map to obvious resources in an experienced REST designer’s mind, but it is very likely that if we had several REST experts there would be at least 2 different opinions of the obvious design.

    Here is a set of strategies that you may see:

    • Controller interface
      • Conceptually this would perhaps be seen as a method sent to a contract
      • Assuming I wanted to make a call to one of my contacts with the id 1234, the REST call may look like this:
        • URL: …/contacts/1234/call
        • Verb: POST
        • Body: Empty
    • Separate resource for call
      • For this design, we are asking the clients to create a new resource-instance every time they call
      • We could call the resource “calls”
      • To make a call to a contact, we would a POST to the “calls” resource. E.g.:
        • URL: …/calls
        • Verb: POST
        • Body: { “contactId”: 1234 }

    In our example, I would argue that the approach of separate resources is superior (although I know that that may be controversial). The hint to why I think so is that we also have a method for calling an arbitrary number that can now be easily mapped. I also think this approach is more extendible to possible new requirements such as:

    • All calls to a previous contact
      • URL: …/calls
      • Verb: GET
      • Query parameters: ?contactId=1234
    • All calls to an area code:
      • URL: …/calls
      • Verb: GET
      • Query parameters: ?areaCode=512

    The ultimate judge on which is a better API is your clients. The question you have to ask is what is the most logical/easy to use model for them!

    Other Interesting Discussions

    With the introduction of the “calls” resource, we have mapped all the original features of the original API. There are some other questions that we could have explored that may have led to a different design.

    1. Can a client have more than one address book?
      1. If YES
        1. Can a contact be shared across multiple address books?
          1. If NO
            1. Would it be better to design the contacts as sub-resources of the address book?
              E.g.: …/address-books/12345/contacts/1 
        2. Is a call in the context of an address book
        3. Can address books be shared? 
    2. Can we do more than call a contact? E.g., send SMS, email, etc.?
      1. If YES
        1. Are the SMS/email/etc. somewhat similar to a call (in other words, should we perhaps have a more abstract resource called “reach-out” that encompasses calls, emails, SMSL etc.)

    In a later blog post, I’ll discuss how to explore these questions by building domain models. However, for this post I’ll keep it simple. The design we have suggests the following:

    • The client has exactly one address book
    • We’re interested in calls as separate resources

    It is important to note that the answers to the questions may have led to very different designs. Also, since REST API’s are often exposed to the public, it is important to think about future extensions before you release an API, hence, the questions above should be explored with the future in mind.

    Did We Improve the World for Our Clients?

    All we have done is to map the existing API into new calls. We could easily have exposed the functional API as a set of Ajax calls. However, I would argue that we have gained a few very important advantages by converting the interface to a REST style:

    • We now have a shared model with our clients. 
    • We have introduced a consistency into the design. A consistency in abstraction level and a consistency in how the API works.
    • With proper REST resource design, we also have a clear and predictable path for future extensions of the API.

    Documenting the API

    A great side effect of the consistency of REST API’s is that clever open-source developers have provided a set of tools that makes it easy to expose API’s. Although it is outside the scope of this post to discuss these tools, I’ll explore these tools and discuss how they work.

    The first tool that I’ll discuss is Swagger. Swagger is a tool that allows you to create a JSON based specification of your resources and the methods you expose. The specification can be used by Swagger tools to generate a web-interface that acts as the documentation portal for your API. The API portal also allows your clients to test out your API without having to write any code.

    Another tool worth mentioning is RAML. RAML works in a very similar way to Swagger. It arguably has a nicer authoring environment.

    Perhaps in a later blog post I’ll highlight some other options for how to define REST API’s. At SciSpike, we have developed some tools that we are about to release to open source that I believe provide an even nicer authoring environment.

    Guidelines for Converting Functional API’s to REST

    Our example is a little too simple to justify all the guidelines below. I will come back in later blog post to justify the list:

    1. Functional interfaces can NOT be automatically converted to REST. Expect a complete redesign when moving from traditional API’s to REST.
    2. Always reflect your clients view. The name and the number of resources you create should be largely driven by what your client wants to see.
    3. Interfaces should be durable. Changing an interface may be a big deal and may be very costly.
    4. Interfaces should be extensible. Evaluate what are likely changes to the interface over time. Can you expand the features without having to change the existing API?
    5. Interfaces should be well documented. In the REST world, you have no excuses for not providing good API documentation (e.g., with Swagger or RAML).
    6. Build a domain model. Expect some part of your interface to be easy to translate (those are typically directly related to CRUD(F)) whereof other parts require significant efforts to find the correct resources. The missing resources cannot be found without a domain modeling effort (I’ll explore this in my next post).
    7. Avoid controllers. Often controller API’s are introduced where another ‘event-based’ resource would be more extendible.

    Conclusion

    In this blog post, I’ve discussed how to convert traditional functional API’s into REST API’s. In a traditional API, there are usually some obvious resource candidates and the methods on the API can easily be mapped to the CRUDF options for the resources. However, some of the features of your API may be more difficult to map to REST. In later blog posts, we’ll explore some of the most difficult resources. One of the great advantages to REST is that in converting your API the implementors and the clients of the API share a common model at a consistent level of abstraction and with predictable ‘method calls’.

    1

    View comments

  4. Introduction

    This is the first blog in a series of blogs on REST. In these blogs I’ll try to demystify the REST. I’ll discuss what REST is and some suggestions for how you can design good REST APIs.

    First, let me acknowledge some of the sources of these posts. I’ve been teaching REST API’s for one of my favorite clients PayPal over the last year. In the process I’ve had the fortune of working with the PayPal services team. The discussions with these individuals and my students have led to a great number of insights on my side. In particular, I’d like to acknowledge Jason Harmon with whom I’ve collaborated on creating the courses that we currently teach at PayPal. Jason’s insight into REST and passion for good API’s are legendary.

    The current plan is to provide a set of blog posts. As I build them, I’ll inject direct links into the below list:

    • What is REST? (covered in this first delivery)
    • Domain driven development of REST APIs
    • Versioning of REST resources
    • An extensive example of how I build REST API’s
    • PATCH vs. PUT
    • Controllers vs. event-soucing

    Is REST services a new idea?

    I actually once started an article on why REST is not new. However, halfway through the article, I had to stop and admit to myself that there are some quite new ideas in REST. I’ve seen similar ideas to REST in various books through the time. For example some of the ideas in Cheesman&Daniels book on “UML and Components” had many of the same design ideas as what was later contributed to REST.

    This is not to discredit Roy Fielding. Fielding suggested the ideas that we now call REST in his PhD theses and I think his ideas are brilliant.

    The main new ideas of REST in my opinion are:

    • Services are organized around resources
    • Resources have unique URL
    • Resources can be manipulated through HTTP
    • The encoding of the resources can be determined by the client
    • The introduction of Hypertext (or more correct Hypermedia) to the API

    All of these new ideas lead to some really interesting benefits. Probably the most important one is a great consistency in the way the API appear to the clients.

    What is REST?

    REST is short for REpresentional State Transfer. Of course, that doesn’t really convey it all, so let me give you my elevator pitch:

    REST is an architectural strategy for defining API’s where the clients and service providers share a common information model and the client activation of the services takes the form of requests for lookup, updates or creation of information in the shared information model. The information model takes the form of sets of resources and their relationship. 

    The protocol of communication between the client and the service provider is HTTP where the resources are identified by a URI, the intent of the query or update of the information model is expressed through HTTP verbs and the format of the information exchange is defined by the content type. The desired content types(s) can be controlled by the client if the service provider supports more than one type.

    What is a resource?

    Central to the idea of REST is the concept of a resource. To help you understand what resources are, I would want to separate two different words to make the narrative a little clearer and also to allow me to draw parallels to those with an object-oriented background.

    • Resource-Set
      • A resource-set is a collection of resource instances. When communicating with a service provider a client would address the resource set to perform operations such as
        • Create a new resource instance
        • Find a resource instance
      • For those of you with an object-oriented background, imagine the resource-set as a class
      • For those of you that have a relational database background, imagine the resource-set as a table
    •  Resource-Instance
      • A resource-instance represents an individual conceptual object that a client and the server agree upon
      • Our operations on resource-instances are typically
        • Read its state (or information)
        • Update its state (or information)
        • Remove it from our conceptual resource-set
      • For those of you with an object-oriented background, a good analogy would be an instance/object
      • For those of you with a relational database background, think of the resource-instance as a record

    The analogies that I suggested for relational databases or object-oriented technologists may be helpful in the beginning, but you have to eventually make the leap to understand that the resource (set or individuals) are conceptual not physical.

    Let’s illustrate with an examples from various domains:

    • Hospital domain
      • “Patients" is an example of a resource-set
        • An resource-instance is “Joe Doe, a patient admitted to the hospital”
      • “Admissions" is an example of a resource-set
        • “Joe Doe’s admission to the hospital on 2/3-2015 11:01PM CST” is an example of a resource-instance belonging to this set
    • Rental car domain
      • "Rental car agreements” is an example of a resource-set
        • “Sarah Smith’s rental at Hertz At San Francisco Airport on 2/3-2013 9:01 AM PST” is an example of a resource-instance belonging to this set

    The point that the resources are based on a conceptual model is very important. The fact that we have a resource-set called Patients, does not necessary mean we have a table called patient in our implementation. All we are saying is that as a service provider (e.g., a hospital-admittence server), we understand what you mean by Patient and we have an API that allows clients to manipulate patients. We may for example provide a resource-set upon the URI http://api.hospitals.com/patients which allows you to lookup all our know patients instances (or resource-instances).

    Whether a client is referring to a resource-set or a resource-instance is clear in every service invocation by the URI that they use. A service provider will publish one URI that refers to the resource-set. A resource-instance is always addressed by adding a unique identifier to the resource set. Let me give you a real-world example. At PayPal we have a resource-set that can be addressed at:

    https://api.paypal.com/v1/wallet/payments

    If I wanted to create new payments or find payments, I would use this URI. However if I wanted to read an resource-instance, in this case a specific payment, I would address that at:

    https://api.paypal.com/v1/wallet/payments/293759715

    Notice that the URI of the resource-instance is identified by the resouce-set + “/“ + the identifier of the payment.

    The CRUDF idea of REST

    REST suggest that most of the things we do as clients of services is to manipulate these conceptual resource-instances. In particular, we may Create, Read, Update, Delete and Find instances. Notice the F is not normally used. This may very well be my own invention. As I was writing this, I looked to see if someone else had used the same term. I did find SCRUD (S for search), BREAD (Browse, Read, Edit, Add Delete), so not an original though from my side. I’ll just continue using CRUD(F) as I’ve been using this term when teaching and I believe many of the readers of my blog are ex-students :)

    Another idea in REST is that we’ll identify what we want to do by a combination of URI and HTTP verbs. That is, we can clearly see if the intention of any client request by simply knowing the URI and which HTTP verb. HTTP supports a set of verbs. Not all of them are used in REST. Most of us are familiar with the difference between GET and POST (GET is typically used to open a page in your browser, POST is often used to submit a form from your browser). In addition to the GET and POST, we also use PUT, PATCH and DELETE.

    In theory, that gives us 10 combinations (two dimensions; instance vs set and the 5 verbs). In practice, we only use 6 combinations though.

    Let me show you in a simple table.

    URI (Set vs. Instance)HTTP verbSemantics
    http://URI_OF_RESOURCE (set) GET FIND a set of resource-instances that belong to the specified resource-set
    http://URI_OF_RESOURCE (set) POST CREATE a new resource-instance and make it available from the resource-set
    http://URI_OF_RESOUCE/ID (instance) GET READ a resource-instance
    http://URI_OF_RESOUCE/ID (instance) PUT UPDATE a resource-instance
    http://URI_OF_RESOUCE/ID (instance) PATCH UPDATE a resource-instance.
    Notice that we already can do this with PUT.
    PATCH is an alternative to PUT whichI'll get back to in a later post.
    http://URI_OF_RESOUCE/ID (instance) DELTE DELETE a resource-instance

    Notice that leaves a set of combinations. Although these combinations are perfectly valid and the semantics of such operations can be defined, they are rarely used. Just for completeness though, let me give you the combinations and explain why they are seldom used:

    • DELETE on a resource-set.
      The semantic of this would be to delete all resource-instances in the resource-set. We don’t usually expose this for two reasons.
      1. The operation is “too dangerous”
      2. It is difficult to handle partial failure. What if we managed to delete a subset of the resource-instances? Of course we can define the behavior and allow this combination, but it is usually not used in API’s I’ve studied
    • PUT on a resource-set.
      The semantic would be to replace reachable resource-instances in resource-set with the new list that we are “PUTting”. The reasons for this not typically supported is the same as for the DELETE on the resource-set
    • POST on a resource-instance.
      I’m not sure what the semantics of such an operation would be, but I know from teaching that students have suggested using this when the client gets to specify the resource identifier. I think this is a bad idea. If you want to support client-derfined resource identifiers, the identifier should be part of the payload (more about that later).

    A simple example

    Let’s design a simple ToDo application. Let’s say we want to create a service that allows clients to produce a list of things to do. The todo-items describe what client wants to do and when the task is due.

    The first thing to do is to try to define the resources. In this example the resource-model is very simple. In later blogposts I’ll show how to use domain models to find the resources. The only resource here is the list of todo items. The REST API may looks something like this

    • To CREATE a new todo item
      • URI = resource-set
        • E.g.
          • http://api.sample.com/todos 
      • HTTP verb = POST
      • The data of the new todo item will be passed to the service in the body of the request
    • To READ a specific todo item
      • URI = resource-instance
        • E.g.:
          • http://api.sample.com/todos/123
          • 123 is the resource-instance identifier
      • HTTP verb = GET
      • The data of the requested todo item will be passed in the body of the respose
    • To UPDATE a specific todo item
      • URI = resource-instance
      • HTTP verb = PUT
      • The updated todo item will be passed to the service in the body of the request
    • To DELETE a specific todo item
      • URI = resource-instance
      • HTTP verb = DELETE
    • To FIND a set of todo items
      • URI = resource-instance
      • HTTP verb = GET
      • The set of resources will be returned to the client in the response body

    Implementation of RESTful Service

    There are some great technologies available that makes the creation of a RESTful service quite trivial. Here are some technologies that you may want to explore broken down by the programming language/platform they work on:

    Conclusion

    In this article, I’ve highlighted some some of the basic principles of REST. In the next article, I’ll take a simple example of a functional API and design some REST resources.

    References

    2

    View comments

  5. Introduction

    Waterfall, agile, iterative, whatever your process, your chance of success can be measured by the number solid feedback loops between the developers and the domain experts.
    In this article I’ll try to argue that no matter your process, you have to find ways to insert frequent checkpoints. I’ll argue that agile development makes this easier, but I’ll also give you an insight into how we do this at SciSpike. At SciSpike we’ve developed a platform that allows to go through iterations in minutes by use of gamification and it’s our belief that such approaches will become commonplace in the future.

    What’s the Problem?

    Semantic Gap
    The picture above illustrates our inherent problem. The systems we’re building are improving a domain. However, the builders are not necessarily experts on this domain.
    Say for instance we’re building a remote patient monitoring application. The physicians understand the domain and know what they want to monitor. However, they don’t have the ability to convert their knowledge into running systems and must rely on programmers to do so.
    The programmers don’t understand the domain and have to rely on the physicians to explain it to them. Their language and abstraction level are dramatically different.
    A typical sign of such semantic gap is a mismatch between what has been built and what was expected by the business. As late as 2012, McKinsey & Company and Oxford University did a survey that showed that on average IT projects are 45% over budget and deliver 56% less value than predicted. Also, 17% of IT projects fails in way that threatens the existence of the commissioning company.
    We typically get conversations like the ones below:
    Example of semantic gap
    Sounds familiar??? I’m sure it does.

    Business Analysts...

    Business Analysts
    The chance for miscommunication is further increased by the introduction of a 3rd party to ‘negotiate’ the requirements and write them up in some functional specification. This person, let’s call him/her the business analyst, is there to help the communication between the domain expert and the implementation team.
    Experienced business analysts with a great deal of domain knowledge may provide an essential service. Unfortunately, more often than not, the introduction of an intermediary increases the chances of miscommunication. This is quite natural and a well research pheromone. Something is always lost in the translation from business language to design language.
    Loss of information in the communication between the domain experts and the developers is increased by the use of natural language. The problem with natural language is that it is nearly impossible to have two people read a natural language specification and arrive at the same understanding. This problem has been addressed other places (see references below) and I’ll take that as given in this article.

    Can We Solve Our Communication Problem with Models?

    We have had elaborate ideas of building models that allow the users and programmers to communicate at a high level of abstraction. We have been pointing to other engineering disciplines where this approach has been successful.
    We may hear things like: “Look, I’m building a house right now. The architect has provided me this floor plan and this façade model… Although we haven’t started building the house yet, we’re communicating with the builders. Why can’t you replicate this for software?”
    Some have been suggesting that we already have such tools and modeling languages that allows us to communication. In most cases, they are referring to  CASE tools that allow us to render some of our requirements in some abstract notations such as UML.
    It’s my experience that such models can be very useful, but only if the parties communicating understand the notation and are able to validate the models. However, I’ve yet to find projects where the domain experts are fluent  in some abstract software modeling notation!
    The fallacy of the modeling approach for software is that it is very difficult for domain experts to translate the notation into requirements they understand. Most people can easily take an architectural plan an visualize what the building would look like when it’s done, however, I’ve yet to find business people that can do the same with a UML model.

    When Are Errors Detected?

    Errors are rarely detected before the end users or domain experts see the software in action. It’s when they use the software or see it demoed for the first time that they understand how it works. For complex systems, hours of use may be required before they discover missing requirements or incorrectly implemented requirements.

    No Waterfall Then?

    Waterfall is expensive because it tends to increase the time between when requirements are discussed and when the resulting system is verified. There are still business domains where waterfall may be appropriate, however, waterfall almost always results in increased risk and project cost.
    Government organizations seem to still prefer waterfall approaches and there are some practical reasons why they do. However, in almost all other business domains, waterfall approaches are to be avoided.

    Is Agile the Answer?

    Agile methodologies focus on increased communication between the domain experts and the developers. In particular, most agile methodologies promote frequent demos of the application. I typically see demos scheduled with regular intervals (once per week to once per month is common).
    The demos improves communication and allows the domain experts to make course corrections early. In well functioning agile teams, we see that they rarely implement more than “one iteration worth” of code before they get verification of correctness through demos.
    So, agile development is clearly an improvement over waterfall, but is it good enough? I’ll argue that even agile development can be improved. I see two major issues with agile:
    • An iteration-long-delay before verification is still too long (in particular in typical Scrum projects where the iteration length is 4 weeks).
    • In many areas, it is impractical to get the domain experts together at regular intervals. They may have to be flown in from various places or they may simply not have the time to come. Most demos I see, the domain experts are glaringly absent…

    A Better Approach

    At SciSpike, we’ve developed a special platform for requirements gathering. We’ve created a language that allows us to specify the application Intent. The Intent language allows us to hold requirement-gathering sessions where the specification is tested in real-time using executable software that we generate from the Intent language. This yields some very substantial benefits:
    • Shorten the iteration cycle (we’re aiming at minutes)
    • Provide formality to the requirements that minimizes the potential for misunderstanding.
    • Generate documentation that the parties can review and verify after the requirements sessions.
    The platform is also used to generate partial implementation (including business logic and a simple UI)  that further ensures that the software development progresses in the desired direction (more of that in another blog post, I’ll just focus on the requirements gathering in this post). We call this an Executable Requirements Specification.
    The approach I’ll discuss below yields a very large set of meaningful iterations even in short requirements sessions. The iterations may just be a few minutes long. The course corrections discovered in our minute-long-iterations are usually discovered in week-long iterations in traditional agile projects.

    The Intent Languages

    We have two main languages that we use to describe Intent:
    • Conversation Language. This language is a language to describe collaborative behavior. It may at first encounter look similar to workflow languages, but there are some important semantic differences. We are using a computing model based on Carl Hewitt’s actor model (we’ve changed the semantic of actors quite a bit, so we call it an agent-model instead).
    • Domain Modeling Language. This language allows us to specify conceptual information models. This language is very similar to the kind of models you can create with a UML class model.
    We decided to keep our languages textual, although we have mapped (and generate) graphical views of the textual languages. We decided not to build a graphical editor because:
    • The speed of editing (much faster when textual). 
    • The advantage of using textual merge/diff tools. It is now easy to support concurrent updates of the models.
    • We want users of the tools to be able to create models without having to install elaborate tools/IDE’s. Ideally, we want developers of the Intent language to only need a simple text editor.

    Gathering Requirements

    To have an effective requirements session, we need the following roles present:
    • Intent language recorder. This is a person fluent in our Intent languages that record decision and makes modification to the Intent Language.
    • Moderator. Someone driving the requirements session.
    • Domain expert(s). Someone that fully understand the domain.
    • Product owner. Someone able to make decisions as to what the product we’re building is supposed to support.
    We often find people that can play multiple of the roles above. In particular, SciSpike usually provide resources that can play the roles of both Moderator and Intent Language Recorder.
    We run the requirements sessions in the following way:
    Mini Iterations
    Each iteration may literally only take a few minutes. In a few hours, we would typically have made more iteration that most projects do in years!

    Typical Requirement Capture

    Most of you will probably used to capture requirements in one of the following three forms:
    • Use-cases (perhaps supported by UML diagrams)
    • User stories
    • Functional specification
    Although each of the forms has a long history of being used to capture requirements, I would suggest that all three forms are suboptimal.
    Use-cases (often supported by UML models) are a semi-formal form that tries to strike a compromise between being precise and at the same time being readable by the various stakeholders. I have definitely seen some well defined use-cases, but most of the ones I see suffer from two fatal flaws:
    • They are ambiguous (I can almost always create a simple questionnaire (usually with multiple choice questions), ask the creators/reviewers of the use-case to read through them and finally administer the questionnaire. Usually, I get an almost random distribution of answers across the questionnaire.
    • They are cumbersome to construct. I’ve witnessed business analysts spending weeks to define even the simplest use-case.
    User stories may work out great when the domain experts are readily available. They are somewhat of a reaction to the formality of the use-cases. However, in most cases, all they do is to defer the discussion to later. They act a simple hook for us to remember a feature, but the details have to be worked out later. In many cases, the domain experts are not available when the user stories are being expanded and we often waste multiple iterations where the developers interpret what they think the stakeholders want, demo it to learn the “real requirement” then finally reimplement it to fulfill the actual requirement.

    Example

    Let me show an example of how such a requirement session may work. I think it will be very difficult to explain this in text, so I have made a small video that may help illustrate how it all works.
    The below videos are the same. I've just made it available both on YouTube and on Vimeo. I've had some bad luck with YouTube where they sometimes shorten the videos, so I decided to make it available on Vimeo also
    Vimeo

    YouTube

    Alternatives
    Unless you also built a tool like us, you may now ask, how can I do what you showed? Being selfish, I would say, contact SciSpike and something can be arranged…
    However, if you don’t want to go down that route, there are other alternatives that you can apply to decrease the time required for a single iteration.
    The main thing to focus on is:
    • How do you capture requirements and are those requirements meaningful and precise enough for all the stakeholders?
    I’ve seen very meaningful sessions using tools like storyboarding, manual role-plays, screen mockup tools, etc. Try to gamify the requirements gathering. Make the specifications as executable as possible.

    Conclusion

    In this article I argued that the cost and risk of software projects are inverse proportional to the length of each meaningful iteration. By a meaningful iteration, I mean an interval where the stakeholders explained some features, the development organization captured the requirements of the features and the stakeholders are able to verify that the development organization has understood the feature.
    In a typical software organization, the length of an iteration varies from 1 week to multiple months. We at SciSpike have developed a platform that allows us to shorten the iteration length to minutes. It is our experience that this dramatically reduces risk and shortens the duration of a software project.

    References

    0

    Add a comment

  6. Introduction

    Over the last 10 years I’ve seen a change in the way the top engineers/architects design, architect and deploy software. This new generation of computer scientists have their eyes firmly focused on quality attributes such as availability, scalability and fail-over-capabilities; issues that are rarely discussed in most enterprises.

    Many companies tell me they have moved to web-scale and deploy to the cloud, however, when I study their architecture in detail I notice that they do not architect the software in a way that allows for scale. For some of them, my warnings come too late when their success catches up to them and their systems are not able to cope with the increased traffic.

    In this article, I’ll explain why the way we traditionally architect software systems doesn’t scale and suggest some approaches for how to improve the scaling of your software.

    Moore’s Law

    Gordon E. Moore, the cofounder of Intel Corporation, observed that the number of transistors in a dense IC doubles approximately every 2 years (later adjusted to 18 months). As software developers we were made to look like heroes by the improvements in hardware. The software we built ran twice as fast a couple of years later.

    The software speedup trend no longer holds for most large-scale software systems. Only those that write software in a way that allows you to take advantage of the hardware improvements continue the trend of Moore law. This excellent illustration (not sure exactly what the original source is, it is flagged as sharable in Google Images, so I’m simply linking to one of the places were it exists), show us why we no longer can rely fully on the improvement of hardware.

    Notice how we keep adding more transistors (fulfilling Moore’s law). The computers are continuing the trend of becoming more and more powerful. However, the method of improvement has changed. Clock speed is no longer the main source of the improved performance (that trend stopped around 2004). Today, we improve performance by adding more cores. 

    In addition to the increasing number of cores on individual machines, we also are trending towards computing on a set of machines (mainframes are not quite dead yet, but even those are often constructed using a grid of computing units).

    Here is the bad news for us software developers:

    Unless you change the way you write code, you will not be able to take advantage of the hardware improvements. 

    I think Herb Sutter put it best

     “The Free Lunch is Over!!!” - Herb Sutter (Dr. Dobb’s Journal, 2005)

    Herb Sutter goes on to say:

    "Concurrency is the next major revolution in how we write software"

    Amdahl’s law

    To understand why Mr. Sutter predicted, “the Free Lunch is over”, we have to take a look at what has been named Amdahl’s law (sometimes also called Amdahl’s argument, named after Gene Amdahl and a presentation from 1967!). Amdahl’s law states that the maximum performance increase achievable by adding more CPU’s at a problem is inverse propositional to the percentage of the code that runs sequential.

    The graph above (source Wikipedia) shows a rather trivial relationship. If the percentage of your code that has to be run sequential is 5% (0.05), then the maximum speedup of your system when adding more processors is 1/0.05 = 20X. If 25% of your software is sequential, your max performance by adding cores is 4X. 

    The Amdahl problem in traditional enterprise architecture

    Most traditional enterprise architecture runs primarily sequential algorithms. In fact, I would argue that over the last 20+ years we have deliberately been constructing languages, platforms and frameworks that promote sequential programming. 

    The main reason for us to promote sequential programming has been that it is (arguably) significantly easier on the programmers. We optimized for quality attributes such as readability, maintainability, ease of programming, etc.

    In a sequential code, it is very easy to understand the state of the program/algorithm. We know that the list of statements is executed one-after-the-other. If the program counter is on statement 4, we know that statement 3 has been completed and that statement 5 will be executed next (after 4 is complete). 

    We also constructed various ways of locking resources so that the programmer was guaranteed exclusive access and did not have to worry about interference from other programs. This locking was typically achieved with locking semaphores or using transactions (e.g., SQL transactions). 

    The ease of programming, we though, led to fewer bugs, easier to maintain code, etc. We were taught not to optimize if it made the code harder to read. The argument was often:

    “If it doesn’t run fast enough, we can always throw more hardware at it"

    Amdahl’s law shows that this approach simply doesn’t work. At some point (quite an early point with current algorithms), you’ll run into the max potential scale. Perhaps this was not an issue for you. If you are building software that services less than 100 concurrent users, you probably never had to worry about this. However, if you intend for your software to be used by thousands (or perhaps millions) of concurrent users, you eventually have to change the way you construct your software.

    In some of our projects, we have to support several million concurrent users/transactions. If we built software the traditional way, we would simply not be able to keep up, no matter how much hardware you put on the problem.

    Good for us, there are known ways to construct software that promote parallelism and hence increases our chances to scale. We’ll take a look at a couple of interesting approaches later in this article, but before I discuss some of these approaches, perhaps we should first focus on what kind of synchronization/sequential algorithms we try to avoid.

    Resource synchronization

    Most enterprise software takes advantage of lightweight threads to ensure some degree of parallelism. This allows a programmer to start several sequential programs at the same time. Unfortunately, these programs rarely achieve orthogonality. In particular, they often share resources. The good news is that we already have invented ways for the multiple programs to collaborate when accessing the resources.

    Unfortunately, the most common collaboration is to sequentialize the access to the resource. That is, if more than one parallel task need to read/update the same resource, we ‘queue up’ the tasks and only allow one of the programs to read/update the resource at one point in time. While one task is accessing the resource, the other tasks are suspended waiting for the accessing task to complete.

    Sequential to improve readability

    Let’s take a look at a typical algorithm in enterprise software:

    1
    2
    log(we are about to open the database);
    connection = db.getConnection();
    

    A compiler will ensure that the two statements above are executed in sequence, however, they could really have been executed in parallel. Most likely, there is no need for the program to wait for the log statement to be complete before opening the database. 

    If one studies a typical sequential algorithm, there are often lots of opportunities for parallelism. However, the style/rules of how the program is constructed often prevents natural parallelism. 

    Concurrency and parallelism

    I’ve read different books giving completely different description of concurrency and parallelism. Books that focus on low level execution tend to specify the difference as:

    Concurrency is when two threads of execution can make independent progress (but does not necessarily execute at the same time). Parallelism is when the two threads of execution make progress at the same time. 

    If you read a book on design or  architecture, you may have read:

    Concurrency is when two program incidentally work separately but may have interference when accessing resources. Parallelism occurs when two programs were designed to work at the same time with no interference.

    I’m going to focus on parallelism in the design/architecture form. That is, I’ll focus on deliberately designed programs that can run independently.

    Actor model

    One famous computation model that promotes parallelism is the actor model. The actor model is a mathematical model for concurrent computation, but the ideas have been actualized in several programming environments. Perhaps one of the most famous such environments have been concurrency model of the Erlang programming language. 

    The actor model is inherently concurrent. From Wikipedia:

    An actor is a computational entity that, in response to a message it receives, can concurrently:

    - send a finite number of messages to other actors

    - create a finite number of new actors

    - designate the behavior to be used for the next message it receives.

    There is no assumed sequence to the above actions and they could be carried out in parallel.

    Perhaps in a later blog article, I’ll go through the actor model in details, however, the linked wiki article is quite good and I’ll refer to it for further details. 

    There have been many critics of the actor model. The most typical complaint about the actor model is that it is difficult (some say impossible) to compose actors. I actually strongly disagree with this statement, I think the composition is different than what people are used to. I believe when designing with actors, one naturally designs clusters of actors that collaborate on some task and that a careful architect/designer will compose the actors into what I call conversations. However, that is a discussion for another article.

    Another criticism (which I do agree with), is that the programming model is more difficult and that only a subset of the programmers out there will master this paradigm. In particular, programmers seem to have problems proving the correctness of their program when some required behavior requires collaboration between a set of actors

    Lambda architecture

    A quickly emerging architectural pattern is called “Lambda Architecture”. Personally, I find the name quite misleading, but trying to be compliant to the rest of the world, I’ll use the term in this article (The Lambda comes from the idea of using Lambda Calculus, Lambda being a letter in the Greek alphabet that Church used to represent a functional abstraction… A long derivation and in my opinion a suboptimal name). 

    In a Lambda Architecture, the processing of incoming data is delayed as long as possible. This is achieved by storing non-mutable facts and processing these facts at a later time. The facts are often simple time-series data. Some of the facts have to be processed in real-time, but in most application, a surprisingly small subset of the facts requires real-time attention. 

    Let me try to illustrate with an example:

    Say we are building a banking system for online trading (e.g., PayPal, Google Wallet). We’re bringing together payers and payees. It is important to us to ensure that we know what the account balances are in real time. However, the individual transactions and all details around them (e.g., where was it performed, what did they buy, what browser did they use, ...) can be processed at our convenience (or at least several seconds/minutes/hours or perhaps even days later).

    Using the Lambda Architecture, we would separate out the real-time data (the account balances) into a data store with extreme performance (e.g., some distributed memory model) from all other data. When a transaction is performed, the balance is checked and updated based on some real-time data store.

    The non-real-time data would be stored as facts in a specialized distributed database.  We typically store is the incoming event with a time-stamp and all the available knowledge of the event. It may be as simple as a log file, but it could be more sophisticated using a specialized time-series database (our (SciSpike’s) default architecture uses Cassandra which has proven to be perfect for this purpose). Since the facts are stored and never updated, this storage doesn’t require typical locking and we only strive for eventual consistency (as opposed to real-time consistency).

    At some later time, we then compute specialized views from the sequence of facts. The views work on the dataset in total atomicity, we can parallelize the generation of these views with ease (e.g., using separate map-reduce tasks).

    Popular programming environment

    Today we have several frameworks that promotes parallelism. I’ll focus on a couple of them and discuss why they make it easy achieve parallelism. 

    Non-blocking and asynchronous programming environments

    One popular new(-ish) technology is Node.js. Why is it that Node.js has become popular so quickly? There are many reasons:

    • The popularity of JavaScript
    • The ease of getting started
    • The anti-framework initiatives

    Although the above properties of Node.js may be the most commonly sited, I’m going to focus on one aspect that is related to the topic at hand, namely how Node.js promotes parallelism.

    For most of you, you may be terrified to hear that Node.js uses a single thread! It’s like we went back to the old Windows environment where there was a single event dispatcher thread that processed all incoming events. I’m sure some of you can remember the hourglass and total lack of responsiveness when someone decided to do something that took time on the event dispatcher’s thread. So how can this improve parallelism?

    Node.js achieves its parallelism by using non-blocking libraries. What we mean by that is that whenever we execute one of the functions in the library by using non-blocking asynchronous communication (Node.js runs on the Google V8 engine that implements a complete set of non-blocking functions) .

    So, for example, say we want to read some data from a console. In node this would be done this way:

    1
    console.read( function(err, data) { /* process the incoming data*/ } );
    

    The code here is quite different from most other environment where the supplied I/O functions would suspend the calling thread until the input was collected from the user. In Node.js, we ‘sending a message’ to the console asking it to read some input, then we’re passing on a function that will handle the next steps when the data has been collected. It may very well be that the Google V8 engine uses a separate thread to read the user input from the console, however, as a developer I’ve delegated this problem to the V8 engine.

    When programmers use the asynchronous messaging style, the algorithm often is naturally concurrent.

    Other languages have introduced similar concepts.

    • Java
      •  Introduced non-blocking I/O from version 1.4. 
      • JEE 6 had partial support for non-blocking I/O and JEE 7 will have full support. 
      • The new concurrency library provides many features for supporting algorithmic parallelism (e.g., futures)
      • Libraries are coming starting to take advantage (e.g., AsyncLogger in Log4J), Vert.x
    • Scala
      • Very strong support for parallelism
      • Akka framework (also available for Java programmers)
    • C#/.NET
    • Python

    Conclusion

    To achieve scalability you may have to change the way you construct software. A very important change is to maximize parallelism. Other architectural tricks are delayed and orthogonal processing of time-series events (also called Lambda Architecture)
    If you have not already done so, it is time to learn and take advantage of the new programming paradigms. It makes a huge difference in your ability to scale and take full advantage of the processors in your deployed system.
     
     

    References

    Electronics, Volume 38, Number 8, April 19, 1965

    0

    Add a comment

  7. Introduction

    Over the last few years I've been experiencing conflicts between pure agile development and the need for budgeting. At SciSpike we do everything agile and it works for us. We have a good idea of the velocity of our teams and can often be very accurate in our estimates.

    Often, when working on client projects, we don't have metrics or heuristics that we can use to help us with the estimation. We may also be working with unknown resources (either they outsource to 3rd parties or the project has yet to be staffed). In extreme cases, we may be also be asked to estimate before knowing what technology to use.

    Eventually, no matter the uncertainty, we will be asked this question:
    How much is this going to cost and when will it be done?
    One answer that is not acceptable is simply: I don't know, no matter how true it is!
    In this article I'll discuss how we may better communicate our estimates and the inherent uncertainty within the estimates.

    Why don't I know how long it will take?

    You may ask: You have 30 years of experience building software and you still can not estimate how long it would take to build something? Perhaps I'm in the wrong profession?

    Published research (Sackman, Erickson, and Grant, 1968; Curtis 1981; Curtis 1985; Mills, 1983; DeMarco and Lister, 1985; Card, 1987; Valett and McGarry, 1989; Oram and Wilson, 2011) establishes a difference in productivity between the most and least productive software teams of 10X (that is, the most productive team is 10 times more effective than the least productive team).

    I know the 10X have been questioned. For example, in [Boehm 81] , the research originally showed an uncertainty of ~4, but in later writing (Boehm, 00) his estimates went up to near 6. I can see that the factor of 6 may be more accurate in some places (particularly when you average out more predictable tasks, such as: deployment, maintenance, training, etc.). Having said that, I have also been privy to some research performed by one of our clients where the difference in performance was measured to be greater than 100X. Whether you want to use the 4X or 100X, the points of this blog are still valid.

    Let's use 10X for now. Think about it for a second. Let’s say I just came off a project where I was working with the top team. I have a good idea of what it would take to build some system. I estimate it based on this experience, but it turns out I get the least productive team and my estimates are off by a factor of 10! Time to pack my stuff!

    Let’s say we have the opposite experience. We worked with the slowest team, estimate the project based on this experience, the project is cancelled because of cost/time and we may have lost a great opportunity. I'll keep my job, but my stock options are not as good as they could be.

    Bottom line, unless you have a stable team with an established velocity, we don't really know how fast we'll be able to develop. This is often referred to as the cone of uncertainty (in software we often call it the Boehm’s cone of uncertainty):
    Cone of uncertainty
    In addition to the uncertainty in team velocity, there is also an uncertainty in project planning:
    • Did we cover all the tasks?
    • Do we really know the complexity of the tasks?
    In addition, Murphy’s Law is never far around the corner.

    The need of the business

    If you are a project manager, scum master or product owner, try to put yourself in the shoes of a CFO or CEO. You have discovered a business opportunity that requires development of some new software. You know how to calculate the upside of the opportunity  (I've actually have built quite a few of these evaluation models). To make a business decision that resolves whether to pursue the opportunity, you need to know:
    • When will the project be ready (or in other words, when can I take advantage of the opportunity)?
    • How much is it going to cost to get it operational and run the system after it has become operational (contributing to the downside of the opportunity)?
    These are important inputs to any instrument-validation model.

    Say you are the CTO or Director of Software Development. The CFO/CEO will turn to you and ask for these numbers. I recommend that you don’t use the argument: "I don’t know; we have to build it before we’ll know". You may just find yourself packing your artifacts by the end of the day.

    So what’s the secret sauce?

    Remember the third item in the agile manifesto?
    Customer collaboration over contract negotiation
    What we need is mutual collaboration. Both the upper management and the development organization have to have mutual understanding of expectation and risk. Without this understanding, the following seems to reoccur in the software organizations I’ve studied:
    • The C-Level is dominant. They will require the estimates and keep the responsible estimators feet to the fire.
    • The estimators use pessimistic numbers to ensure their carrier is not in any danger (we’ve seen by factors of 4-10).
    • Opportunities are sometimes lost because the cost of the project was overestimated.
    • If the project starts, the budget will be used (Parkinson’s Law) and money is wasted.

    How to express uncertainty?

    I suggest that the estimate should be a graph rather than a number. The graph should show cost/time curve based on probabilities. I often hear project managers  say, it will take 6 months and cost $2 million even before having a team or an established architecture. I would argue they are simply guessing. Also, anyone that is constantly 100% accurate, is most likely pessimistic, has caused loss of opportunities and is wasting money (Parkinson’s Law.) The only way to be 100% accurate is to allow a margin of error that is detrimental to the business ability to pursue opportunities.

    If someone that estimates have to give a number, the best ones would be wrong all the time. However, the times they are too optimistic should balance with the time they are too pessimistic.

    The estimation graph

    I’m not the first to suggest that estimates should be represented as a graph. In fact this was imprinted in me from my studies of practical operational research back in my university days.

    It seems most methods suggest that the probability curve would follow some kind of normal distribution pattern. Others argue it may follow a Parabola, Trapezoid, Parr or Rayleigh curve (Basili, 1981). I’m going to use a normal distribution curve for simplicity here (we are so imprecise here anyway, so which ever curve is better would make little difference I think).

    Although the discussion of the best way to create such a curve is outside the scope of this blog article (I rather focus on the communication between project team and upper management), I’ll show you one way that you can produce such a normal distribution curve using what’s often called the “three-point estimation”. I’m actually not sure what the origin of this estimation is, but here is how you do it:
    1. List all the tasks that you have to perform
    2. For each task estimate three numbers:
      1. The most optimistic time (ot)
      2. The most realistic time (rt)
      3. The most pessimistic time (pt)
    3. Assume that the mean time can be found by the formula:
      (ot + 4*rt + pt)/6
    4. Finally, we calculate the standard deviation using the following formula:
      (pt-ot)/6
    Now that we have both the meantime and the standard deviation, we can draw a normal distribution curve and we can now somewhat reason over the numbers.

    Reading and evolving the distribution curve

    First, we have to be sure that we understand how to read the graph. As you probably remember, there is some science to this. Assuming we made honest estimates, we’ll have a 50% chance of exceeding the mean time formed by the curve (the peak of the curve). If you want me to have an 80% (83.6% to be accurate, but we’ll keep to round numbers) chance of beating my estimate, we should set the date to the mean (peak) + 1 standard deviation. If you want to have a 90% certainty, let’s add two times standard deviation.
    To get good numbers you’ll have to:
    • Try to find ALL the tasks. There are several tricks for this, but to enumerate and explain them would require several blog posts and it is not the point of this article.
    • Break the work into tasks that are of appropriate sizes (my suggestion is no less than 1/10th and no more than 1/100th of the total effort)
    As the project starts, we can use agile measuring techniques to start narrowing the cone of uncertainty. As we establish a team velocity and get a well-established backlog, I would expect the distribution curve to narrow and the estimate to be more and more precise.

    A simple example

    Let’s say we have a project with 10 identified tasks (or stories, or features, or function points, or whatever you want to call them). To stay general, I’ll name the tasks A..J.
    Let’s assume our best guess estimates are something like:
    Example Project
    This would suggest that we are 50% sure we’ll deliver in 44 units (say days or $1,000). We are 80% sure we can deliver in 50 units and 95% sure we can deliver in 56 units.

    You may protest here (in fact you should protest here). Remember we said that the difference between the best and the worst performing team is around 10X. Since we don’t know if we get the most productive or the least productive team, we should have a difference between the optimistic and pessimistic of 10! Although I can usually do a better guess than that, let’s for sake of argument make the difference between optimistic and pessimistic 10:
    Example10X
    Notice that even if I kept the realistic (R) and optimistic (O) constant, the numbers are now quite terrifying. Now, what we are suggesting is that our 50% chance estimate is 86 units, our 80% estimate is 124 units and that our 95% estimate is 182 units.

    Even though this is still way more accurate than what the empirical research suggest, I’ve found that the numbers seem to work out OK when the project is of sufficient length to where it is possible to improve on the skills of the team during the project (either through resource swap or training).

    Summary

    When starting a new project in a new context, we typically don’t know what resources are required to complete the project. This does not diminish the need for business leaders to get reasonable estimates from the technologists. I suggest that we need to be more transparent in our communication of estimates and ensure that the uncertainties of the numbers are well understood and accepted by both parties. I’ve seen opportunities lost because the estimates were too pessimistic and I’ve seen well-meaning leaders being fired because the estimates were too optimistic.

    I suggest that the uncertainty can be communicated with a simple graph. I showed how to use a three-point estimate to produce such a graph. I would prefer to use more scientific methods, but this method is easy to use and communicate to the business. This method does not reduce the need for the estimation methods suggested by agile methodologies. Using backlogs and burndown charts is still recommended. As you learn the velocity of your team, the accuracy of the estimates will improve and as they improve, the new estimates should be presented back to the business.


    References and further reading

    • Basili, Victor and Beane, John. 1981 “Can the Parr Cure Help with Manpower Distribution and Resource Estimation Problems?”. Journal of Systems and Software 2, 59-69. Boehm, Barry W., and Philip N. Papaccio. 1988. "Understanding and Controlling Software Costs." IEEE Transactions on Software Engineering SE-14, no. 10 (October): 1462-77.
    • Boehm, Barry, 1981. Software Engineering Economics, Boston, Mass.: Addison Wesley, 1981.
    • Boehm, Barry, et al, 2000. Software Cost Estimation with Cocomo II, Boston, Mass.: Addison Wesley, 2000.
    • Boehm, Barry W., T. E. Gray, and T. Seewaldt. 1984. "Prototyping Versus Specifying: A Multiproject Experiment." IEEE Transactions on Software Engineering SE-10, no. 3 (May): 290-303. Also in Jones 1986b.
    • Card, David N. 1987. "A Software Technology Evaluation Program." Information and Software Technology 29, no. 6 (July/August): 291-300.
    • Curtis, Bill. 1981. "Substantiating Programmer Variability." Proceedings of the IEEE 69, no. 7: 846.
    • Curtis, Bill, et al. 1986. "Software Psychology: The Need for an Interdisciplinary Program." Proceedings of the IEEE 74, no. 8: 1092-1106.
    • DeMarco, Tom, and Timothy Lister. 1985. "Programmer Performance and the Effects of the Workplace." Proceedings of the 8th International Conference on Software Engineering. Washington, D.C.: IEEE Computer Society Press, 268-72.
    • DeMarco, Tom and Timothy Lister, 1999. Peopleware: Productive Projects and Teams, 2d Ed. New York: Dorset House, 1999.
    • Mills, Harlan D. 1983. Software Productivity. Boston, Mass.: Little, Brown.
    • Sackman, H., W.J. Erikson, and E. E. Grant. 1968. "Exploratory Experimental Studies Comparing Online and Offline Programming Performance." Communications of the ACM 11, no. 1 (January): 3-11.
    • Sheil, B. A. 1981. "The Psychological Study of Programming," Computing Surveys, Vol. 13. No. 1, March 1981.
    • Valett, J., and F. E. McGarry. 1989. "A Summary of Software Measurement Experiences in the Software Engineering Laboratory." Journal of Systems and Software 9, no. 2 (February): 137-48.
    0

    Add a comment

  8. In my previous blog post I praised the open-source database Orient DB. I’ve received many email from people since asking me questions about Orient DB or telling me that they’d love to try Orient DB out but don’t know where to start.

    I though it is time for me to contribute to the Orient DB community and write a short tutorial for exactly this audience. Ideally, I'd like to ensure that if you sit through the videos below, you'd at least have an informed view of whether Orient DB is of interest to you, your project and your organization.

    Think of this blog post as the first installment. I may come back with some more advanced tutorials later. This tutorial shows you how to:
    • Download and install Orient DB
    • Play with the Orient DB studio and perform queries on the Grateful Dead database provided with the community edition
    • Create your own database (again using Orient DB studio, perhaps later I’ll provide a tutorial to show how to do this from Node or Java)
    • Create server-side functions
    • Perform graph- and document-based queries over Orient DB
    I hope this tutorial gives you a good idea of how Orient DB works and some of its power. I also hope it will help illustrate the reasons for my enthusiasm in the previous post where I argued the case for Orient DB. In this post I’ll focus on how you get started.

    Introduction (or, Read Me First!)

    I had some of my guys review the videos for this tutorial and they pointed out that I assumed that the viewer/reader knew something about graphs. Although this may be a bad assumption, I don’t want to patronize you.

    This is my attempt at giving you a short introduction to the concepts of graphs and the vocabulary I'll use in the tutorial.

    Orient DB is not only a graph database. It supports all (or nearly all) the features of a document database and an object-oriented database. Hence, discussing only graph theory may be a bit misleading. However, since I’m using primarily the Graph API of Orient DB in this tutorial, it may be prudent of me to ensure you know something about graphs.
    Here is what you need to know.
    The above picture is taken from the Wikipedia article on Graph Databases. The illustration shows 3 objects (as circles) that we keep track of. There are also lines (or arrows) that define relationship between the objects. We call an object vertex and a relationship edge.

    I do have some issues with the graph. They always show two edges that only differs based on the direction they are read. I would have preferred a single line (or a single edge). However, since it is likely that you would lookup Graph Databases in Wikipedia, I thought it would be better if I showed their example and criticized it (OK, I will go and suggest another drawing for Wikipedia. It is on my to do list. Honestly!)

    A graph (for the purpose of this tutorial) consists of:
    • Vertices. A vertex is a cluster of information. The information is typically stored in key-value pairs (aka properties). You may think of it as a map (or I’ve notice that many developers call these maps hash tables, however, I find this misleading as hash is a particular strategy for arranging keys in one of the map implementations).
      One way of seeing a vertex is as an identifiable set of properties, where a property is defined as a key-value pair. Unique to Orient DB, the properties can be nested to form complex data structures. I will not do any nesting in this tutorial. I do, however, plan to write another tutorial where I explore Orient DB as a document database.
    • Edges. An edge is a link between two vertices. Edges may also have properties. Edges have have direction. That is, they start in one vertex and end in another vertex (actually, to be correct, they may also end in the same vertex. Such edge has the cute name ‘buckle’ in graph theory).
    That’s it! Really.

    Most of you probably have some background in relational databases, so let me give you a comparison between OrientDB and typical relational databases. The comparison will be slightly misleading no matter how I present it. To minimize the confusion, I’ll decided to add a column to discuss how the two technologies differs also.
    Graph DB Relational DB Differences
    Vertex Row A row in a RDB is a flat set of properties. In Orient DB the vertex may be an arbitrarily complex structure. If you are familiar with document databases, it is basically a document.
    Edge Relationship A relationship in a relational database is not a first class citizen. The relationship is made ad-hoc by use of joins on keys. In Orient DB (as in all Graph Databases) the edge (relationship) is a first class citizen. This mean it can have an id, have properties, etc.
    You may ask, where is the comparison of schema constructs such as table, column, trigger, store-procedure, etc. I think the tutorial does discuss these constructs, so I’ll delay the discussion to later in this tutorial.

    In the second tutorial, I will repeat this short theory session. To help you progress from this incomplete theory session (and for your reading pleasure), I’ve provided a few links below:
    I decided to use Orient DB Studio in the videos. It is probably the least likely tool that you’ll use to access the database. Most developers prime exposure to Orient DB is through language bindings or libraries. I was thinking of selecting one of these environments and show how Orient DB is used in there. The problem is that if I picked one of the environments I would risk alienating the ones using other environments. I think no matter what environment you're using, you would at some point would bring up Orient DB studio, but perhaps I’m wrong and I just alienated everyone. I hope not!
    If time permits, I’ll come back with separate tutorials for how to use Orient DB from Java/Scala/Python/Node.js/...

    Part 1: Download and Install Orient DB

    I’ve prepared a simple video that explains how to download and install Orient DB. I’m on a Mac, hence the video shows how to install it on Macs. I do, however, think it should be easy  transpose the steps to Windows or Linux.

    At the time I created this video, the latest version of Orient DB was 1.6.2 (things are moving rapidly at OrientDB, and I noticed they already released 1.6.3 before I got a chance to publish the blog). Orient DB releases many versions per year, so there is a risk that at the time you read this tutorial, my instructions may be slightly out of date. All the steps that I followed in the video can be found on the Orient Technologies website and I’m sure it will be kept up to date. Hence, if the version you downloaded is different from 1.6.2, you may be better off following the up-to-date instructions on the Orient DB sites.

    A bit of warning: Towards the end I do execute a few queries to ensure that everything works. Don't despair if you don't understand what I'm doing. All I wanted to was ensure that I'd succeeded in installing Orient DB.

     

    Part 2: Play Around with the Grateful Dead Graph

    Orient DB comes with an example database instance. The provided database contains a graph defining performances of Grateful Dead. The graph is sparse and the information model is quite simple. This is unfortunate as it is hard to illustrate all the power of Orient DB with this graph. Perhaps at a later time I’ll try to find an open source dataset that has more interesting information (suggestions welcomed!).
    The Grateful Dead graph can be illustrated with the following UML model:
    NewImage
    The model above is as truthful as I could make it. I did take exercise some artistic 'enhancements'. This to make the model easier to read. Here are my embellishments:
    • Notice the attribute on the ‘followed_by’ association. Vrious versions of UML have provided different ways to depicting attributes on relationships. Let me explain what I mean by my way of using UML. The example database stores a count the number of times a song has been followed by another song in a concert (I did not create the model, so I’m providing my own interpretation). The weight is a count of the number of times. To illustrate this in UML I have to show properties on associations.
      Conceptually, this is not a problem for a Graph Database because it can store properties on edges. It may, however, awkward to see the mapping to other technologies.
    • I decided to expand the enumerated type specifying the "song_type" . A more common way to model this would be the use of an enum-type. However, I think ‘display-wise’ the expanding the enum is better here as it is only used in one place.
    • Another ‘cheat’ is the introduction of classes. The example database do not actually use custom classes (all the objects are instances of a class called ‘V’; short for  vertex). They do, however, provide a property called ‘type’ that is used for the purpose of typing. I believe the model I created is better at illustrating the original intent of the model. 
    I would assume only a percentage of the readers are familiar with Grateful Dead, so let me give you a few sentences about them. Grateful Dead is a cult rock band that started playing in California in the 60’s. I assume the graph is populated from the same data set that produced this  illustration:
     
    It is really not important who the Grateful Dead were and whether you like them or not. Just think of them as some rock group that performed a set of songs very many times. Someone cared enough to capture when what songs they performed, what song followed another song, who wrote the song, and who sung the song.

    With that introduction, let’s move on to the video where I create a set of queries and navigate around the example database.
    Just for your reference, here are some of the queries I used
    • Select all vertices (or object) in the database
      • select * from V
    • Select the vertex with the id #9:8
      • select * from V where @rid=#9:8
      • This could also be written as:
        • select * from #9:8 
    • Select all the artists
      • select * from V where type=‘artist'
    • Select all the songs that have been performed more than 10 times
      • select * from V where type = 'song’ and performances > 10
    • Count all songs
      • select count(*) from V where type='song'
    • Count all artists
      • select count(*) from V where type=‘artist'
    • Find all songs sung by the artist with the id #9:8 (notice, the result will include the artist with id #9:8)
      • traverse in_sung_by from (select * from V where @rid=#9:8)
    • Find only songs sung by the artist with id #9:8 (only songs)
      • select * from (traverse in_sung_by from (select * from V where @rid=#9:8)) where type=‘song’
      • This could also have been written as:
        • select expand(set(in('sung_by'))) from #9:8
    • Find authors of songs sung by the artist with the id #9:8
      • select * from ( traverse out_written_by from (select * from ( traverse in_sung_by from ( select * from V where @rid=#9:8 ) ) where type='song') ) where type=‘artist'
      • Or more effectively:
        • select expand(set(in('sung_by').out('written_by'))) from #9:8 

    Creating Our Own Database

    In this section, I'll explore how you can create your own database. I'll be using custom classes in this tutorial (unlike the Grateful Dead database that only used the built in 'V' class).
    NewImage
    The following video shows how we can create a database that uses the above model as its schema:


    Below, I've included the statements required to create the database used in the above video.
    create class Member extends V
    create property Member.name string
    alter property Member.name min 3
    create property Member.password string
    create  property Member.email string
    create class Article extends V
    create property Article.title string
    alter property Article.title min 3
    create property Article.title string
    create class follows extends E
    create class replies extends
    create class authors extends

    Populating the database

    In the previous section we defined a database schema. That means that we’ve prepared Orient DB to store and enforce constraints on objects we’ll insert into this database. It would be possible to store the same information without the schema (in so called schema-free mode). However, since we decided to go down the route of schema-full use, Orient DB can help us enforce the constraints on the data structures.

    The video below explores different ways of inserting data into the schema we created in the previous step.

    If you don’t want to sit through the video, I’ve included some functions/key script fragments below the video for you to instantiate the database as you please.

    Important: For some of the functions that I show in the video to work, you have to add a handler to the Orient DB-server-config.xml. Please add the following XML fragment to the oriented-server-config.xml file (this step is covered in the video):
    <handler class="com.orientechnologies.orient.graph.handler.OGraphServerHandler">
       <parameters />
    </handler>
    Why are you doing the above step you may wonder? It is to ensure that the default handler used in the JavaScript functions is of the graph type.


    Some key syntax for inserting data:
    create vertex {Vertex Class Name} set ({propertyname=value},…)
    create edge {EDGE CLASS NAME} from {OUT_VERTEX_ID} to {IN VERTEX ID}

    Part 5: Querying the Database (Again)

    Now that we have some data (assuming you finished step 4), let’s see if we can define some interesting queries. If you want to try it out without seeing what I do, here are some suggestions for queries to formulate:
    • Given a member X:
      • Lookup everyone that follows X.
      • Lookup all articles that X posted that was eventually replied to.
      • Lookup everyone that at one time participated in an article that originated with a member X.
    • Find all members that have answered some article posted by someone else.
    • If I changed the content of an article, who would be effected by this (the members that posted articles prior to this article and the members posting articles as relies to this article.


    Summary

    In this post I’ve tried to give you a head start on Orient DB. Most of my use of Orient DB in this tutorial has been graph-oriented. If time permits, I’ll try to make another tutorial later that expose the power of Orient DB as a document database.

    I hope you learned something. Please don’t feel shy about leaving comments. I actually moderate the comments as I’ve had a very large number of spam posts. If you’re comment doesn’t appear immediately, please do not disappear. As soon as I get a chance to make sure the comment is not spam, I’ll allow it and try to answer you.
    9

    View comments

    1. Petter very nice job, I have been working with OrientDb for a little while now. I made a little custom rest adapter to allow for using isomorphics smartclient framework with Orient. One of my biggest problems now in actually building a database is schema design coming from a relational and going to graph. Not alot of docs beyond simple examples. I am looking at designing a CRM app with Orient and still struggling with graph schema,design like with graph when does link and linkset come into play vs just an edge, but you tutorial has helped. Would love to see more on this subject, Nice job!

      Thanks,
      Dan

      ReplyDelete
    2. Excellent tutorial. thanks for posting.

      ReplyDelete
    3. Hi,
      I need for help!
      Assume I have query : select Mail, expand( both('Friend') ) from User where Name='hoang'
      Query return all field. I only want to get field Name, Mail. How can i do that.
      Thank,

      ReplyDelete
      Replies
      1. Can't you just wrap a simple query around your other query?

        Delete
    4. Really excellent tutorial, thanks for taking the time to put this together! I was able to follow along perfectly until 8:14 of Tutorial video #4, when I got the following error:

      "Error on parsing script at position #0: Error on execution of the script Script: createSomeMembers ------^ sun.org.mozilla.javascript.internal.EcmaError: ReferenceError: "gdb" is not defined. (#11) in at line number 11"

      It seems obvious that the reference to the database is broken, but I cannot find anywhere in the OrientDB docs where a "gdb.save()" command exists, and the complexity of SQL vs graph is a bit confusing when searching in the docs for the right spot.

      But in spite of this current roadblock, your tutorial has been really excellent and is very much appreciated! Thank you!

      ReplyDelete
      Replies
      1. Did you setup the graphical database? I show you how to do that in the first video...

        Delete
      2. same error occured when executing the function..

        Delete
    5. This comment has been removed by the author.

      ReplyDelete
    6. https://groups.google.com/forum/#!topic/orient-database/BOueQD8hMOA

      Any thought on this ?

      ReplyDelete
  9. Every now and then you come across open source projects that just amazes you. OrientDB is one of these projects.

    I’ve always assumed that I’d have to use a polyglot persistence model in complex applications. I’d use a graph database if I want to traverse the information, I’d use a document database when I want schema less complex structures, and the list goes on.

    OrientDB seems to have it all though. It is kind of the Swiss army knife of databases, but unlike a Swiss army knife, each of the tools is best of breed.

    I’ve had a few experiences with applications built on OrientDB and also been spending some time testing and evaluating the database. I keep thinking back to projects that I’ve implemented in the past and wishing I had OrientDB to my disposal. Asking questions such as:
    • Would it be a viable candidate to replace the database we used?
    • How would I have changed the architecture if I did use OrientDB?
    • What would the impact of OrientDB be on factors such as:
      • Elegance of implementation
      • Cost of development
      • Scalability
      • Availability
      • Flexibility/mutability
      • and so on…
    In this article I’ll explain what OrientDB is (from my perspective), why it may be hard to classify and some scenarios of how it could be used.

    What is OrientDB?

    OrientDB is a tool capable of defining, persisting, retrieving and traversing information. I want to start there, rather than saying it is a XXX type database. This is because OrientDB can be used in multiple ways. It can play a document database (making it a competitor to MongoDB, CouchDB, etc.), it can be a graph database (making it a competitor to Neo4J, Titan, etc.) and it can be an Object-Oriented Database. And it can play all those roles at the same time.

    OrientDB as a Document Database

    Let’s look at OrientDB from the perspective of a document database. OrientDB can store documents (documents here being a nested set of name-value pairs). Perhaps you’re familiar with MongoDB or CouchDB? If so, OrientDB can take an arbitrary document (e.g., a JSON document) and store it. After it has been stored you can query it using path expressions, as you would expect from any document database.

    If you ever worked with document databases, but may sometimes come across the need to store links. I see this all the time. Say we used a document database somewhere and some of the team members have experience with relational database. When they discover the primitive support for links we’ll have long discussions of normalizations and how document database are different etc. I can usually convince the members that the document database is a better solution, but the truth is… I kind of miss my relationships.

    OrientDB as a Graph Database

    Talking about relationships, the ultimate in handling relationships are, as you probably know, graph databases. Graph databases typically implement the relationships as first class citizens called edges (first class citizen as opposed to relational databases that uses key/foreign-key). Edges connect vertices. A vertex, in most graph databases, is a simple cluster of name-value pairs.

    Now, imagine each document in the document database as a vertex? Is that possible? OrientDB has done exactly that. Instead of each node being a flat set of properties, it can be a complete document (with nested properties).

    OrientDB as an Object-Oriented Database

    Why do we create documents? What do they represent?
    In most cases I would think that each document represent some conceptual object? Think of it. What does each of your documents in a document database represent? Perhaps it represented a company, a person or a transaction? I would say, more generically, it probably represented an object. Also, in document databases, we most often type these documents. That is, there is a class of documents that follow the same set of rules.

    How about graphs? I would suggest here also each vertex typically maps to some conceptual object.
    In OrientDB the vertex and the document are superimposed and also here it would be interesting to think of the document/vertex as an object and the rules for objects following similar rules as classes. So, let’s assume we want to impose rules for the data structures of each related object like in an object-oriented system, what advantages could we obtain?
    1. We would have a guarantee that the objects conformed to some rules we defined
    2. It would be easier to query the objects because they at least named the properties the same
    3. Perhaps we could use relationships between rules for structures as in object-oriented systems (often called inheritance) to organize the rules.
    OrientDB allows you to define classes that the objects (vertices or documents) must conform to. It is probably necessary for me to point out that OrientDB does not force you to do so. You can run in strict schema mode (all objects are typed and must conform to the class definitions), in a hybrid mode (all objects must AT LEAST conform to the rules of the classes but may add any other properties not specified in the classes) or in schema-less mode.

    I can hear the skeptics here…. Sure, we’ve seen this with some of the relational database vendors also. When they got scared of objects, they introduced something that looked like classes, but when we studied it closer it was missing important things like polymorphism (e.g., you could define a hierarchy with Pet as a super class, Cat and Dog as subclasses, but the database would not understand queries like “give me all pets”, you would have to ask “give me all dogs and cats”). However, in OrientDB this is working too!

    Come on! There has to be a Catch?

    Perhaps it doesn’t scale? Maybe it doesn’t perform? This is too good to be true!
    I’ll keep looking and if I find something I’ll post it. The two questions above were where I thought I’d find the issues.

    Scaling

    OrientDB scales. Really, it truly scales. It seems to have a much better strategy than its competitors. It’s hard to know exactly who to compare it to… do I compare it to the document databases or the graph databases? I decided to look at both categories.

    I’ve yet to test this out in a large project. However, at least on paper, the master-master replication, the multi-cluster support, etc. makes me very optimistic with respect to scaling.

    Performance

    It is very hard to find performance numbers that compare databases. I did actually see some test from a university in Japan where they compared performance numbers of the various graph databases. OrientDB in this test was outperforming the competitor by a factor. But since the numbers are from older version of both OrientDB and the competitor tools, I’m not sure how much weight to give the test. The first time we used it on a client application where the client allows me to publish numbers, I’ll promise to share the numbers. One client did allow me to at least say that from their numbers, OrientDB still outperformed their competitors and what was more interesting (to me), it also outperformed one of the leading relational databases in some non-traversal scenarios (we know graph databases are fast when we lookup a vertex/document and start navigating from there, however, I would have thought it would not be able to compete for queries such as “select * from Person where firstName like ‘%Petter%’).

    Use Cases for OrientDB

    I would think almost anywhere you build a canonical information model to store the state of the system, OrientDB would be a good choice. I’m not sure it is the best choose for time-series databases (perhaps a database such as Cassandra would have an edge here), however, for most traditional domain models, it should work well.

    Traditional Domain Model Implementations

    For most systems, we build out a domain model (or logical information model) that describes what information the system must maintain. Because RAM is more expensive many other storage forms and because RAM has a tendency to loose its state when the power goes off, we want to ensure that this information is stored on some disk somewhere so that it can be put back into RAM upon need.

    The state of the art for building such models is to build an object-oriented class diagram, typically in UML (I would say this could be argued. There are some better alternatives here. For instance Express/Express-G and Clafer are perhaps better languages, however, UML is more readily adopted, so… UML it is…). This model defines classes with their properties and associations between classes. With most databases we’ll experience some impedance mismatch when mapping the canonical model:
    • Relational Databases
      • No support for inheritance. Need different strategies such as single table inheritance, table per class, etc.
      • Complex properties typically require their own table. It is now unclear if the table represents an object with or without individuality (or at least the distinction is lost when looking at the tables).
      • No support for polymorphism.
      • Relationships have to be mapped into key/foreign-keys.
    • Graph Databases
      • No support for inheritance (although, with some clever engineering, one can define the meta-data hierarchy and vertices and edges and get pretty close).
      • Complex properties introduces new vertices even though we don’t really need to link to them (no individuality).
      • No support for polymorphism.
    • Document Databases
      • No support for inheritance (although, easy to simulate)
      • No (or limited) support for relationships
    In OrientDB the mapping pretty much eliminates all impedance mismatch:
    • An object becomes a vertex
    • Complex properties can easily be handled as documents
    • Explicit support for relationships
    • Understands typing (in schema-model) means
      • Polymorphism
      • Strict enforcement of constraints

    Reflective Systems

    From the top of my head, I can think of at least 20 projects I’ve worked on that had the need for some degree of run-time configuration where user could configure the rules of the data structures stored. For those of you that have never worked on these kinds of systems, you may not fully appreciate the complexity introduced in the implementation by such demands.

    Perhaps you have developed a web-form before that collected some data from your users. You knew what kind of data structure to obtain and you simply defined a form that was capable of collecting such information.

    I want you to imagine a system where you are selling the capability for customers to define their own forms, then it is your task to build a system that allows for the customer to define the form and collect and store the information from these forms. What would such a system look like?
    If you have already build such a system, you probably know that you’ll end up with two distinct kinds of data:
    • Metadata
      • Defines the rules for the data structures
      • E.g.
        • Form a has to include a string called Social Security Number and it must conform to some reg-ex pattern
    • Instance data
      • Defines the actual data collected and links from the meta-data that establishes semantics
      • E.g.
        • User John’s instance of a form where there is a string property with the value 123-45-6789 that was collected as a social security number (the link to the meta data)
    Now, I want you to imagine the relational database schema behind such an application and the complexity in the joins that retrieve this data. Not much fun!

    A schema-free document database could at least simplify this problem (you would still have to do some nasty coding to match up the meta-data with the instance data, but the data model would be elegant and quite explicit) In a graph database; similarly, such problem can be easily accommodated.
    Here again, OrientDB shines, by:
    • Being a document database (you can store any document you want on a vertex)
    • Being a graph database (you can simply introduce new edges and properties, remember properties are simple name-value pairs)
    • Allowing for schemas to be introduced at runtime and hence enforcing many of the rules for you
      • In many cases, this could be your metadata!

    Summary

    This has been a shameless tribute to the brains and muscles behind OrientDB, the most versatile database I’ve run across. May the future bring you fame, fortune and happiness!

    PS. I decided after I created this post that it may be time for me to contribute to OrientDB. Here is a get started tutorial.
    6

    View comments

  10. Introduction

    In preparation for a course I'm teaching over the next couple of weeks on OO and Scrum, I spent the weekend reading up on the new books on Agile and Scrum to see if there are points that I need to add to my courses.

    During my preparation I read a few books and papers discussing distributed Scrum. They contained good tips, but I felt they missed a few important points. My company partake in quite a few distributed projects, so I though I'd share some of our key lessons.

    Also, I though I'd root these lessons in the agile manifesto. The blogpost focus on the challenges distribution introduces to the agile goals and how we overcome them. I'm not going to try to explain agile, XP or Scrum. If these words are unknown to you, you probably have to read up on those topics before coming back to this blogpost.

    The agile manifesto reads:
    • Individuals and interactions over processes and tools
    • Working software over comprehensive documentation
    • Customer collaboration over contract negotiation
    • Responding to change over following plan
    Distribution often makes this manifesto item hard to follow. In particular, it is hard for individuals to collaborate when not co-located. Often agile teams prefer to work in the same room (often called bullpens) where the cost of collaboration is minimized. In a distributed world, you can't just lean over to the guy with the knowledge you're missing and get the answer immediately.

    Individuals and interactions over processes and tools

    It is somewhat ironic that we've overcome this obstacle primarily with tools, but I think you'll realize quickly that the tools I'm referring to are different than the tools originally targeted by the manifesto.

    Collaboration tools

    There are many ways to collaborate online these days. Skype, MSM, Yahoo, WebEx, TeamViewer, NetMeeting and Google (Hangout, documents, chat, phone, etc) to mention a few. These are all good tools, but they still don't come close to the feeling of being in the same office.

    At SciSpike we're using a tool called Sococo. Although we could probably make our collaboration work with any of the other tools, Sococo seems superior for agile work. In Sococo, the agile team are given a virtual office that may look something like the picture below (the picture was stolen from the Sococo site, but I'm sure they don't mind :)):

    There are several ways to configure the office, but the idea is that everyone has their own office and there are a set of meeting rooms. Each person is represented as an avatar, and I can at any time see who is in the office and who's currently discussing/meeting/...

    So… say we have a meeting in one of the meeting rooms, it may look something like this (again the picture is stole from their site…)

    The above picture tells me that Me, Berry, Kent and Mary are meeting. Kent is presenting on two screens, Benny on one screen and I am also presenting on a screen (see the little bubble next to the screen). I'm not going to go on to explain Sococo, but notice how I can lower my cost of communication. For the most common scenarios, the process of initiating a collaboration looks like this:
    • Say I want to talk to Jim and Jim is in his office:
      • Double click on Jim's office
      • Start to talk
    • Say I want to have a meeting with Jim and Mary
      • Right-click on Jim and Mary and invite them
      • Jim and Mary clicks on a popup and they automatically enter my office
      • We start to talk
    • Daily scrum
      • Set a time
      • Everyone enters the office (you'll see who is there)
      • Go around the room (each of the avatar lights up when they're talking, so you can easily see who's talking)
    I encourage you to try out the tool for yourself, but the bottom line is that Sococo makes it as easy (or in many cases easier) to interact with your coworkers remotely.

    Although the Sococo tool works perfectly for most communication, we do also use other communication tools. We've found the Google Hangout and Skype with screen sharing to work better when two developers want to collaborate (the screen sharing in Sococo doesn't expose the mouse, which makes it hard to 'point' to locations on the screen). For pair programming, TeamViewer is really good.
    UPDATE: With the Sococo Version 2.3.0 released July 2013, the screen sharing has been improved dramatically. We are now also using Sococo for pair programming!

    Distributed pair programming

    One effective agile practice that helps spread knowledge and help collaboration (disputed perhaps, but effective) is pair programming. We've found that this can be quite effectively performed remotely. The screen sharing programs are quite effective here. In fact, it may even be better than sitting together as both can have a comfortable position watching the screen.

    We don't use Sococo for this as the screen sharing there is not quite responsive enough, but tools like Skype, Google Hangout, TeamViewer works perfectly for this.

    We use pair programming judiciously. We find it to be very effective to communicate, teach, work through critical passes of the design and a few other obvious scenarios. However, in our particular case, our engineers have been picked based on their strength as individuals. They are quite comfortable taking on tasks by themselves and collaborate selectively when required.

    Hire engineers can work distributed

    SciSpike's model is entirely distributed. We hire the best engineers we can find no matter where they may be located. However, one of our criteria is that they are comfortable and thrive in a distributed environment. Learning how to collaborate effectively across boundaries like distributed locations is not for everyone. An important part of our interview process is to find out if someone can collaborate well in an online setting.

    Use a test driven approach

    I would recommend using techniques from the Test Driven Development approach (TDD). The key advantage is that you can formalize what you expect through your tests. Ideally, I like for the consuming and producing team to sit together while defining the tests, but after the test have been defined, the need for communication between teams decrease significantly.

    Use formal contracts where bandwidth of communication is low

    When being distributed there are cases where you just can't communicate as smoothly as suggested above. The reasons may be many. A few that we come across are:
    • Teams in different time-zones with little or no overlapping hours. The prime example of this is of course outsourcing.
    • Specialist teams brought in for their individual expertise. The most common scenario for us at SciSpike is that we work with a User Interface company responsible for the front-end experience (UIX) and where our team build the back-end services.
    • Integration between two software components that build software using two different processes/technologies.
    Because we run across these scenarios quite frequently, SciSpike developed a set of languages that allows us to specify the integration points for all the above scenarios (and a few more). The primary focus of these languages is to formalize a contract between the integrating parties that is unambiguous and verifiable. From these languages we sometimes generate code that either implements the integration or tests the correctness of the integration on either side.

    Even if you don't have these languages (we've been thinking of open-sourcing some of these languages,and if we do, I'll be blogging about it), you can come a long way using other approaches by using more traditional techniques:
    1. Partition the project into components and distribute the implementation to the various partners in a way that minimizes the need for communication. 
    2. Formalize the interface by defining test suites that clearly defines what it means for either partner to be correct (using TDD if you'd like).
    In some cases I've seen attempts of shifting hours when working across time-zones. I've not seen much success with this. All it seems to do is to upset the developers and even when you do shift the hours, you often can only get a few hours of overlap (e.g., US-India or US-China).

    Another anti-pattern here is to introduce a set of informal communication (e.g. email) about project statuses etc. First, I think project statuses should be read out of your agile tracking tool, the test results, etc. Also, if you really need to synchronize further, build some precision into your language. I mentioned above that SciSpike has developed a set of specification languages for this purpose. If that's not an option for you, there are some higher level languages that can be used for communication. UML is an obvious one (although I've not found a lot of good use for it int this respect, it is not precise enough to specify semantics of the software and it is too cumbersome to produce). Interfaces with good documentation can go a long way too. In the .NET space you may want to use something like Spec#.

    Bring everyone together for the sprint planning

    I like to bring the team together for the sprint planning. I prefer 4 weeks sprints (never any longer). It is possible to do sprint planning remotely also, but I prefer to be co-located during sprint planning. Bringing the team together to one location once every month is not all that expensive and it smoothens out  the online collaboration. I usually use the last day (or at least 1/2 of the last day) to work out the architecture and design and ensure that we have an initial plan for the sprint. In case we use a contract language or TDD to specify components (see discussion later in this article), I'll try to formalize these also in the sprint.

    Working software over comprehensive documentation

    This manifesto item is really a reaction to common bureaucratic processes (often introduced by some kind of standard organization like ISO or CMM). We've probably all been exposed to projects where we knew we were working on documents that would never be read, but that the process required us to do to be in compliance.

    I don't think distribution provides much of a challenge here (that is the principles below are good practices for a collocated team also. In fact, I'll just list them below with no further discussion. Notice that the I'm using cloud-based services for the most part, however if you have a nice corporate cloud that would of course also work. We often setup an Amazon based machine(s) and install our own build development services, but there are plenty of hosted cloud-based offerings available.
    • Cloud-based source code repository (e.g. GitHub, centralized SVN).
    • Automatic cloud-based build services that checkout and build the software immediately. This software should also distribute the status of the build.
    • Cloud-based component management that allows the developers to 'push' and 'pull' components from some centralized repository.
    • Processes that 'prove' that the software work. Perhaps you  have a Q/A person or the product manager sign off on a story after a demo
    If I was to summarize the above points into a simple statement it would be: "Keep all your software in a place where all locations have equal access and ensure that the software is proven to work by some automated process:"

    Customer collaboration over contract negotiation

    The challenge here is to get customers involved with a distributed process. Here again we rely on online collaboration tools such as Sococo. We insist that our customers partake in the virtual office and that they keep regular hours connected. It is now very easy to setup an ad-hoc meeting with our customers to iron out issues.

    We also encourage our customers to partake in the daily scrum and when possible keep them engaged in demos and tests.

    Another crucial element is to ensure that the progress is tracked in real-time. We teach our customers to read our agile tracking tool and ensure that they are engaged as frequent as their schedule allows.
    If our customer is external, we also highly recommend that they get involved in project in the role of Product Owner and Scrum Master. This ensures that the customer is always fully aware of where we are in the process.

    One may argue that the above points are the same no matter if you are distributed or co-located. I would agree with this, but I have to say we spend more time teaching our customer and making sure that they fully understand the agile reporting and how to read progress (e.g., burn down charts).

    Responding to change over following a plan

    You have to setup a process for handling change (doesn't matter if your team is distributed or not). I have not found any particular challenge with being distributed with respect to this manifesto item. However, I've seen many projects using too much time trying to freeze requirements because they are distributed. I don't see this as being necessary if you follow the advices presented above.

    We do make sure that we track the progress of development in real-time. We've tried out many different tools to do this. They all have their advantages and disadvantages, but it is critical that everyone in the project has access. Tools like Jira (with their agile extension), PivotalTracker, Rally, etc works

    Conclusion

    In this blog article I've described some of the techniques we use at SciSpike to conduct agile practices even when working in distributed teams. We find that you can work effectively when distributed and still follow the agile practices. There are cases where some formalization of contracts are required, however for the most part, the trick is to enable online communication and ensure that the participants use the tools available effectively.

    0

    Add a comment

About Me
About Me
Blog Archive
Subscribe
Subscribe
Links
Subscribe
Subscribe
Loading