Dalian

Scaling Out Like Technorati

My fellow World Economic Forum Technology Pioneer, David Sifry, the founder of Technorati, was also in Dalian, China for the “Meeting of New Champions” or “Summer Davos” as the Chinese like to call it. During Davos in January, I had the great misfortune of pitching Alfresco against Technorati in a competition between tech pioneer companies. As fantastically well as Alfresco is doing, Technorati has the temerity to compete against Google in blog search and win.

I got the chance to talk to Dave during the conference and ask him some questions on the technology and architecture behind Technorati, the internet blog search site. I thought that someone who could take ordinary computer components and build a huge internet architecture could possibly teach something to people running enterprise architectures that are puny in comparison.

Technorati is a web site that tracks blogs, pictures and any user generated content and allows you to search those sites about what people are thinking, seeing and hearing. When a new or urgent situation breaks out, you can do worse than to search Technorati for immediate reaction. Every day, every hour, every second, Technorati is indexing over 10 million blogs with over 10 billion objects. Technorati’s user base is doubling every six months and quick and accurate response is critical for retaining those users.

Davidsifry
David Sifry, Founder and Chairman of Technorati

I asked Dave about his architecture and what applicability their might be for enterprise architectures.

John Newton: In building Technorati, what were some of the issues that you had in architecting your systems.

David Sifry: I was looking at just temporal information. I had no idea how big it could get. When I looked at the architecture, instead of architecting it right, I architected it for right now. I had no big budget and I didn’t want to wait six months to build it. Also, I had no idea what the killer app would be.

I focused on data flexibility. At the time, that meant putting everything into a relational database. That was okay while the size of the indexes is less than RAM and about a million blocks of data. That was okay while there were less than 20 million blogs.

The next generation took advantage of data parallelism. That meant upon update send a signal to all the other systems. We expanded the data over several “shards” [segments of data partitioned on different databases on separate machines].

What was surprising was that we were writing as much data as we were reading. At this point Technorati was as big as some of the biggest OLTP. Even so, maintaining data integrity was important, because you would want the link count [count of how many other blogs point to a particular URL] to be out of sync. This put real pressure on the system. At the same time, we realized that time was more important dimension than URL. People didn’t want to sort or search on URL, they wanted to search on time. [i.e. what are the latest blogs on a particular subject?]

By this point, we understood the application more and more. The app [Technorati] is about real time access. You need to be able to count on finding latest information on a subject. That’s when we built the third architecture. Scaling was well understood and we build the shards on time rather than on URLs. Instead of putting data into a DBMS, we put it into special purpose databases. It was more of a bus-based architecture. Each database could be scalable and grow as big as we needed.

JN: The notion of shards - did you call it that at the time? I have been looking into shards and I was only aware of or heard of them for about the last year.

DS: Back in 2002 when we were pitching this to VCs, I at least explained the theory. All I just thought through the problem carefully. Doing it this way, we could add hundreds of systems, lots of cheap CPUs, RAM and disks. It provides inherent parallelism. I can’t believe that I was the first one to think this up.

JN: How big does this architecture scale?

DS: We are loading one terabyte a day into Technorati. That’s 100 million blogs or about 10 billion objects. A lot of is new types of tagged data. There are about a half billion videos and photos.

With all that data, you have to think about what do you throw away?  We can’t really delete anything, because we are potentially losing an asset. We don’t delete anything. So we take data out of the spin cycle. [Transitory data used in preparation.] We take the long-term data and put it into low latency storage.

When data is doubling in size every six months, that means that only one quarter is a year old. We don’t need to worry old data.

JN: How do you deal with large number of users with very large data sets?

DS: Any off the shelf tools falls over. There is a lot of interesting analysis on old data, but no off the shelf tools can handle that much data. It’s only just now that some tools can handle it.

JN: What are those tools?

DS: One is Green Plum by a bunch of O’Reilly guys. If you use ordinary data warehouse tools, they would just scream and shout.

JN: Actually what I was originally referring to was the fact that you are showing lots of data that are not users used to enterprise information management tools. How do you present this information to consumer-level users? How do you deal with the user interface and visualization of all this data?

DS: Gotcha. It depends on what the user wants to get out of Technorati. If the user wants search results, then we give it to them. Sometimes they want to browse or discover information. We have spent a lot of time on visual design. Then we give them lots of bright, shiny things for them to click on.  Things like metadata, video or other links.

We have used enterprise class web tools to analyze what users are doing? We look at the click stream and see what is successful or not. That helps to make the information contextual.

One of the big mistakes that we made is to not do this [buy click stream analysis tools] sooner. It was only $80K. Up to that point it was so much trial and error. I’m glad we finally did it. Now we can see how much time a user spends on a feature. We can see page views, goals per visitor.

JN: So what do you measure on Technorati?

DS: Measuring a web site is like forecasting the weather. Yesterday it’s sunny and today it is cloudy. Why is it cloudy?  Sometimes you have no idea. Sometimes you realize that that a change in barometric pressure has a lot to do with it.

We look at the number of newbies, number of reports, session lengths and then measure them against prior periods. It’s not always consistent.

I had never built a B2C site before. I just focused on me, on what I wanted. That worked well for a while when I was the target audience. But we have to build for a broader audience.

JN: At Alfresco, we measure conversions. Are you measuring things like performance? Does that affect retention of users?

DS: Of course, but if the system is falling down, then even performance doesn’t matter. So I don’t get too stressed out about it.

JN: When we met at Davos you wanted to move Technorati to be the Internet Now! Is that still the case?

DS: Everything is shifting. I wanted it to be a site that everyone is able to use. We forgot about the core users that just wanted to find out about blogs and any real time information. In an attempt to jump the chasm, we chased after 100 million users and tried to be everything to everyone. Now we try to make blogs and user driven content available for those looking for that.

Also performance is improved significantly. Now I notice how slow other sites are. This is a total tribute to the engineering team. Everything is easier and faster.

Pretty soon we will have a whole lot of stuff that we have been working for a year.

JN: Can you say what it is?

DS: I don’t pre-announce.

JN: What does the Technorati brand stand for today?

DS: Good question. What’s popping up now on the internet, especially user generated content? It’s about users tagging user generated content and finding it.

JN: Who are your competitors?

DS: I probably sound like the typical entrepreneur, but nobody really seriously. Google provides blog search, but other than that nobody really. Other people are trying to identify and tag information like Digg and del.icio.us, but they aren’t really competition.

JN: What do you want Technorati to be in two years time? Five years would be ridiculous.

DS: I would like Technorati to be a profitable business that is strongly differentiated. It will be the place that you would go for mobile, RSS or push information. For all that you would come to Technorati.

Summer Davos in Dalian China

Last week I was in Dalian, China for the World Economic Forum Inaugural Meeting of the New Champions. That’s a mouthful, so the Chinese simply called it the “Summer Davos”. It makes sense as this feels very much like Davos only a bit smaller and slightly more relaxed and less intimidating. It is still difficult to be really relaxed with some many diverse bright minds, but the scope of topics was more manageable the number of sessions made it easier to choose. There were still the same types of plenaries, panels, board room discussions and collaborative workshops. The focus was global, but the star of the show was China as the world's manufacturer, major outsource destination, next consumer society, and next world economic power.

Wenjiabao

Wen Jiabao, Premier of the People's Republic of China, speaking in Dalian. Photo courtesy of the World Economic Forum by Natalie Behring.

Given where we were, the Chinese government made a concerted effort to put on a really good show. Dalian is a city that most of the participants that I spoke to, including myself, had never heard of before this conference was organized. Few of us expected a large city with tall, new buildings, clean streets and significant infrastructure. While I expect some small, seaside fishing town, what I found was a major industrial port with tourist attractions and resembling a Chinese San Diego. Everything was big, clean and shiny. What I have heard is that Dalian was the point of Japanese invasion during World War II and the Sino-Japanese War of 1895. For good or ill, there has been a long history of connections to Japan that has encouraged investment in this industrial capacity and outsourcing. Much of the city looks like it has been built in the last decade. Companies such as Intel, HP and British Telecom have very large development operations there.

Due to the location, the spectacular growth and potential power of China, 1500 attendees came to the World Economic Forum event to learn more about China. The majority of the agenda of the conference focused on the role that China and the other “New Champions”, India, Russia and Brazil or the so called BRIC countries, will play in the global economy, including in information technology, outsourcing and innovation. I was particularly interested in software and technology development in these countries. As discussed in numerous sessions, by many measures China is the third or fourth largest economy and is on track soon to become the second largest economy. During the conference, several people described the United States as the Great Britain of the 21st Century and nobody disagreed. Still with large numbers that means a GDP of under $5000 in the coastal areas and $1000 in the interior and obviously still a developing country.

I was invited to a private lunch that benchmarked venture capital investment between China and India that featured some key venture capitalists like Joe Schoendorf from Accel Partners (who are an investor of Alfresco). One of the presenters, Professor Martin Haemmig from the Center for Technology and Innovation Management, has probably come up with the only analysis of the VC investing between the two countries and the US. Martin stated that over the last five years the median return on investment in Chinese technology has been an astounding 25 times, a top quartile return of 40 times and lower quartile return of 14 times. This compares with the US where the median return of venture capital is 6.8 times. Even the bottom return of a Chinese fund exceeds the best return of a US fund. As Martin points out, “No wonder all the VCs are piling into China.” Although China doesn’t really come close to matching the amount of VC investing in the US, it is now number two in the world. There is six times as much VC investing in China as there is in India.

Everyone anticipates continued growth in China, especially as the boom moves westward from the coastal cities. As in Davos, I focused much of my time on the interactive workshops that allow us to engage more directly the other participants of the conference. I will be writing up what I learned at those sessions and interviews with some of the people that I met that are from the IT sectors. I also like politics, so I will write up my views on what I learned just from being in China and one of the most controversial sessions featuring Thomas Friedman, the author of “The World is Flat”.

My Photo

  Subscribe
Add to Google Reader or Homepage
Subscribe in 

Bloglines

Subscribe in NewsGator 

Online
Add to netvibes
Subscribe in FeedLounge

Blog Roll

Powered by TypePad
Member since 02/2005

My Online Status