The next generation web, scaling and data mining will matter

Posted on Mar 2, 2006by Ed Sim

We are all enjoying the benefits that come with the commoditization of existing hardware and software infrastructure. It is true that it costs exponentially less to launch a business today versus five years ago. We are all smarter, broadband penetration is reaching critical mass, and open source and commodity hardware have become reliable alternatives to proprietary architectures and closed systems. As we all move forward with our web-based operations, it is clear that scaling the back-end infrastructure still remains a formidable challenge. There have been many an instance of popular services going down – remember Typepad, Salesforce.com, and del.icio.us as a few examples. With scaling the backend also comes a need to learn more about your users and their interactions. Data mining and analysis is becoming a big thing to not only help companies create better services but also to generate more revenue per user. In addition, for many web companies extreme data driven applications are the core of their services. Think about Zillow, Technorati, and services like Indeed which are dynamically driven services based on aggregating, crawling, and filtering millions of pieces of data. However, the fast growth of many a web-based operations combined with the need to mine the data leaves a big hole in the revolution of the cheap. Web-based operations need an open source way and cheaper option to scale their database needs, move to a data warehousing architecture without breaking the bank, and scale with user growth leveraging commodity infrastructure. Enter Greenplum (full disclosure-Greenplum is a portfolio company and I am on the board) which just released its GA product Bizgres MPP for data warehousing leveraging the best of the open source PostgreSQL database. We have been working on the code for the past 18 months, and I am quite proud of the team for having delivered the release. Greenplum is taking the best of the open source database PostgreSQL and rebuilding some of the core functions like the query optimization, execution, and interconnect. We are allowing anyone to build a shared nothing architecture ala Google to scale their backend to multiterabyte sized systems leveraging cheap hardware. It is free to run on a single machine but if you want to run a massively parallel option we charge a fee per CPU.

Dana Blankenhorn from ZDNet gets it:

This is a problem a lot of Web 2.0 start-ups like Technorati, Bloglines and Flickr are facing, and projects like Drupal will face soon. They were built with open source tools, but then find they need to "graduate" to something like a data warehouse. And there’s old Oracle, telling them there’s nothing from an open source supplier that can deliver what they need. Share with us, they say, you don’t have any choice.

Well, now there is a choice. Greenplum CTO Luke Lonergan said that O’Reilly Media, one of Greenplum’s early customers, graduated from mySQL to PostgreSQL with Greenplum and got a ~~100%~~ 100 times improvement in database access speed across a 500 Gigabyte database. Other Web 2.0 start-ups, and projects, can do the same thing.

"The price of conversion is where the pain is," said Yara, "but look at how fast some of these projects grow." While mySQL was smart in building on a lightweight Web base, more and more users and projects will find the need to graduate, and face proprietary FUD from major vendors saying they have to pay the "monopoly tax" in order to grow.

I truly believe the next battleground will be based on scaling the back end and more importantly mining all of that clickstream data to offer a better service to users. Those that can do it cheaply and effectively will win. The tools are getting more sophisticated, the data sizes are growing exponentially, and companies don’t want to break the bank nor wait for Godot to deliver results. Given these trends, I suggest downloading Greenplum’s Bizgres MPP and let me know what you think.

Published by Ed Sim

founder boldstart ventures, over 20 years experience seeding and leading first rounds in enterprise startups, @boldstartvc, googlization of IT, SaaS 3.0, security, smart data; cherish family time + enjoy lacrosse + hockey

View all posts by Ed Sim →

12 comments on “The next generation web, scaling and data mining will matter”

Dainel Nerezov says:

Mar 3, 2006 at 1:05 am

Okkkkey…yes, the problem discussed here is real.

We are scared big time that increased traffic will bring down our hosted software. And we don’t know how to solve this problem.

I’m looking at Greenplum.
Dainel Nerezov says:

Mar 3, 2006 at 1:05 am

Okkkkey…yes, the problem discussed here is real.

We are scared big time that increased traffic will bring down our hosted software. And we don’t know how to solve this problem.

I’m looking at Greenplum.
Cem Dalgic says:

Mar 3, 2006 at 10:53 am

More than this. The next step will be artificial intelligent searching and data mining algorithm. Hello to Israel 😉
Vladimir Miloushev says:

Mar 3, 2006 at 6:07 pm

The problem is there, it is bad, and it will only get worse. It is not limited to Web 2.0 startups, either – in the last 9 months, I have met with at least a dozen Fortune 1000 CIOs, and with a single exception, all of them have the same problem in-house. And, for each of them it is a $100M+ per year problem.

I wish scaling the back-end could be achieved by simply beefing up the database engine. Unfortunately, in real-world applications things are not that simple. To provide rich user experience, one needs to keep increasing amounts of state at all tiers of the application. The new AJAX interfaces are “chatty” and require frequent interactions with the back end. User expectations are moving closer and closer to “real time” – we click on an image or a link (or, sometimes, just hover the cursor over it), and we expect something to happen right away. The only way to meet these expectations is to scale the back end infrastructure as a whole.

Conventional wisdom says that the way to scale the back end is by buying “enteprise” gear and software, e.g. Oracle. Yet, despite the high prices of this inifrastructure, there is no evidence that it is even capable of scaling to the size of Google or Yahoo.

The only technology we have today that is proven in real life to scale from an old x86 box in the corner to 100,000+ servers in a single application (Google) is open source software running on commodity hardware.

The real question is, how do we make this technology easy enough to scale that every Web 2.0 startup can do it without having to spend (and raise) millions of dollars on IT infrastructure and consultants.
Igor says:

Mar 6, 2006 at 10:16 am

I’ll believe in 100 times better performance when I see all the details about hardware, MySQL configuration, schema and queries used in both cases etc. Unless their MySQL DBA is REALLY clueless and hardware is comparable, this 100x speed advantage seems highly dubious.
Pingback: Tech-Confidential
Ian Holsman says:

Mar 7, 2006 at 1:03 am

what is the difference between the commerical and open source offering bizgres provide?
Charlie Crystle says:

Mar 8, 2006 at 11:38 pm

Finally a tech post I can relate to. Not a lot of people really understand how to scale server applications. So we get a lot of little web 2.0 features that, when popularized, start to grind to a halt and make it not so neat after all. And most server software efforts consider “scaling” to be adding more boxes and distributing load across them. But that’s not scaling, that’s renting more space. Horizontal scaling.

Vertical scalability should be the goal–add resources to a server and get a linear increase in performance. Add processors, performance rises proportionally (not perfectly, but close).

So, yes, many people can hack together a few features and call it baked, but those are just trinkets. Once the masses start calling, you better have an idea of how the smallest of things can bring down your neato AJAX UI or patchwork mashup.
Jonathan Lambert says:

Mar 9, 2006 at 7:16 pm

Forget Ajax, the JMS backend we’re working with right now generates insane query load, which you can scale with paralellism. Ajax can be optimized with disk caching technologies (see ibrix.com for example), and smart software architecture. Backend technologies are chatty. Server and security logging is chatty beyond belief!

It does get to be expensive to do a big parallel system.

Anything to optimize the number of units deployed has value in my architecture. +1 to this idea, but it’s definitely only part of the solution.
Kevin says:

Mar 13, 2006 at 8:49 pm

Interesting post. But I wonder how Postgre become an OSS in the first place and never became as successful Oracle or DB2?

Ingres, Postgre and MySQL all have some features that’s better than the others. Many tools from the OSS community can help remedy what Fred suggested.

Oh, btw – did IBM also decided to OS the DB2?
Nick says:

Apr 3, 2006 at 11:13 pm

Just letting you know this article has made it to the most popular page on VCNewsCentral at http://www.vcnewscentral.com. Well done!

VCNewsCentral.com is a brand new blog & news aggregator, just like digg and reddit, but specifically for VC and Startup news.
VCNewsCentral lists postings from all the leading bloggers in the VC industry and then allows anyone to vote on the most interesting postings.

Check out the VCNewsCentral site and let everyone know what is interesting to you!

VCNewsCentral – VC, Startup and Entepreneur news aggregator at http://www.vcnewscentral.com
Ravi says:

Jun 4, 2006 at 9:57 pm

The next wave of data mining needs to address the chasm between structured data (numeric) and unstructured data (text).

Comments are closed.

Share this:

Related

Published by Ed Sim

12 comments on “The next generation web, scaling and data mining will matter”