Basically, the problem that kawalpemilu.org trying to solve is:
How do you digitize over 470,000 C1 image scans so that it can be summarized easily?
The first idea that come to mind is Optical character recognition. However, I could not find a good enough online OCR tool that can handle handwriting recognition. I did find some websites that do handwriting recognition, but I couldn't find a program that can be readily used (usually the program needs some initial training data beforehand). Since the handwriting of the images come from different people from all over Indonesia, I figured that none of the handwriting recognition would work well (I'd love to be proven wrong).
The second idea that come to mind was crowdsourcing. This is not a new idea, since there were a number of other crowdsourcing websites already running (such as realcount.herokuapp.com and http://kawal-suara.appspot.com). The crowdsourcing idea is to build a system that allow users to see the image and manually enter the seen digits into the system. This idea has its own challenges:
- How do we find the users?
- How do we verify the digitized results? (how to make it accountable?)
- How do we make it fast and addictive? (we only have less than 2 weeks to digitize the 470,000+ images)
- How do we make the system to scale? (handling thousands of concurrent users? crackers?)
We gathered insights and learned from the existing crowdsourcing systems to address these challenges. At that time we seem to have the answers to these challenges so we went ahead with the crowdsourcing idea. Essentially, answering how kawalpemilu.org works is the equivalent to answering how it addressed the challenges above. Now, I'm going to elaborate in more detail for each challenge above.
How do we find the users?
Learning from the other crowdsourcing systems, it seems easy to find volunteers to help in digitizing the images. However, allowing arbitrary volunteer to contribute can backfire. If some volunteers are malicious, they can enter garbage values into the system and it would be very hard to cleanup later. There is a technique that uses at least two users to digitize the same image and only accept the values if both users enter the same value. This technique slows down the data entry by at least twice and may not be effective against collusion attack (i.e., a set of users agrees on certain value for Jokowi and certain value for Prabowo).
Our solution is to leverage trusted volunteers. Quora User recruited a set of people he trusted as volunteers, and in turn those volunteers may recruit another volunteers that they trusted and so on. The number of volunteers grew rapidly and after just a few days we stopped the recruitment when we reached 700 volunteers. At that point the digitization process is almost done (around 97%):
How do we verify the digitized results?
Via another crowdsourcing system :). Yes, we built two systems: the internal system (private) and the external system (public). The internal system is only accessible by the trusted volunteers. The external system is accessible publicly, so anyone around the globe can verify the digitized results that were entered by our trusted volunteers and report any error back to us.
The trusted volunteers need to sign in using their Facebook account to use the internal system. The volunteer's Facebook account is associated with the images that they digitized. This is designed on purpose so that malicious volunteers will think twice before they try to mess with the internal system. Moreover, each image can only be digitized at most once, so that malicious user cannot edit the values of the images that were already digitized. Edit privileges are only given to "super trusted" volunteers. We received around 4000 error reports and fixed them (sometimes it involves broken images which we reported to KPU). So far, the error reports we received were due to honest mistakes. I'm really grateful to know the trusted volunteers community who are scattered around the globe:
How do we make it fast and addictive?
The existing crowdsourcing sites I tried uses randomization for determining the next image to digitize. One advantage for this approach is that it makes it hard for malicious users to collude. On the flip side, I think, it demotivate honest users since the user cannot control which region they want to help digitizing. Moreover, it does not give the users any reward (i.e., the users cannot see the values they have just entered). To motivate the users, one may create a rank list of users who digitized the most images. However, this may backfire (i.e., the users may make more digitizing errors to get higher rank).
Since I didn't have enough time to create a game (i.e., to use gamification for the digitization process), I'm thinking a simpler way to give rewards for the users and give the user freedom which region they want to digitize. I figured, what about showing back the users the values they just entered? That seems rewarding enough. For example, after the user did the hard work in digitizing the images from a particular village, they can roll up (drill up) to see who wins for that village. This approach also frees the users to pick which region they want to digitize and make it easy for organizing the volunteers to distribute the work.
To make the internal system fast, we use fancy dynamic html to auto-advance to the next image when the user hits the submit button for the current image. This allows the user to digitize an image in only 5 seconds. We also crop the images near the numbers and sent to the users a smaller sized image (i.e., 10KB instead of 150KB) to speed up page loading. We built a custom in-memory database using C++ from scratch to ensure low latency for navigation and digitization submission. Below is the latency distribution. Most requests are completed in less than 1 millisecond.
I was lucky that at that time I was in Singapore where I have installed SingTel Fibre BroadBand at home. I can download the (70GB) images from the KPU website quickly to make the smaller cropped images. Otherwise, I would have to hot-link the images to the KPU website which was sometimes slow and sometimes unavailable which may slow down the digitization process.
How do we make the system to scale?
Once the external system is made public (kawalpemilu.org) and covered by media such as MetroTV, detik.com, the number of page views increases rapidly. According to Google Analytics, over the two weeks period (13 - 27 July), we received close to 3 million page views:
At peak, we received around 5000 concurrent users (in the realtime overview). We also received DDoS attacks where the external system was hammered with 9000 requests per second.
We used Google App Engine (GAE) for the external system. The external system fetches and caches the data from the internal system periodically (e.g., every 10 minutes for recently accessed pages). Thus, no matter how many requests the external system receives, the internal system is unaffected (i.e., the internal system only receives a constant low number of request per second). We rely on GAE for scalability and security. That is, as long as Andrian Kurniady puts in a valid credit card for the GAE account, we can rest assured that the external system will be safe from attackers and stay available :).
UPDATE Jan 10, 2015:
How does the in-memory database works for the internal website?
The in-memory database accepts 18 operations via HTTP requests:
https://github.com/kawalpemilu/kawalpemilu2014/blob/master/internal-backend/api.cc#L1092-L1109
The main data structure for the administrative area navigation is a Tree data structure (using KPU's internal ids in their database):
https://github.com/kawalpemilu/kawalpemilu2014/blob/master/internal-backend/load.h#L110
The in-memory database advantages:
- It is simple. If you can solve this INC-H problem, then you are qualified to write an in-memory database yourself :). 50 indonesian teams who participated in the Indonesia National Contest 2014 can solve it. So, it's not hard. This in-memory database can be coded in less than 1 hour!
- It is single threaded. Since i'm using libuv for http-server, all requests are serialized. Thus, I don't need to think about data-race, synchronizations, etc. This gives the Isolation property in ACID.
- High performance. It loads the entire 470K+ nodes into a tree data structure in < 2 secs. Each query is answered in at most 4 ms (mostly it's 32 microseconds). The tree data structure allows each update request to be done in O(1), and maintains the aggregated results in the intermediate nodes, allowing each summarization query to be answered in O(1) as well. Without a tree data structure (normal database don't have it), you would need to run an expensive "GROUP BY" query that scans the entire rows in the database.
The in-memory database disadvantages:
- Not fully durable. I am using a very crude mmap (memory-mapped) files to persist the data in-memory to disk. So that, if the server crashed or shut down, the data is saved on disk and can be re-loaded. It could be that the last few updates before the server crash, the updated data were not yet saved to disk. Luckily, during the heavy load in the crucial first week, the server didn't crash :D. This is a trade-off I was willing to make. The http-server was well tested over the years for running the uHunt API.
- Not atomic. If there is an update request that updates several values in the database, and the server crashed midway, it could be that parts of the updates were successful while the rest are not. To minimize the severity of this, I minimize the things that I need to store to disk. I don't store the aggregated values in the tree, I recompute them upon reload. The things that are persisted to disk are TPS records and error reports only.
- Not consistent. Obviously :). It doesn't have a roll-back mechanism in case an update query failed in the middle.
As you can see, there's lots of severe disadvantages, yet I still used the in-memory database because I needed the performance. The internal-website is very responsive due to this and it becomes "addictive". There are research that say: if the response time is slow, then the user will "context switch" to other things. With <4 ms response time, the volunteers won't have the time to context switch, and thus stay focused on data entry :), i.e., addicted.
That's all about how kawalpemilu.org works. You can still ask for more details in the comments below and I will update the text above.