Twitter is great for making friends and sharing links, but researchers are also increasingly using it to study human interactions. This is more difficult than it sounds, since privacy settings and caps on server access can make it hard to gather research data from social networking sites.
Social Network Write Generator (SONG) generates data that closely replicates the behaviour of genuine tweeters. The team gathered 12 million tweets written by 2.4 million people between November 25 and December 4 2008. They cut out the 75% of users who didn't send a single tweet during the 19 day period and filtered for spammers by looking for accounts with a high tweet-to-followers ratio, leaving them with a dataset of around 350,000 users.
Analysing these users revealed a number of properties that the team replicated in SONG. They found that general tweeting levels build up during the day then die down at night, but also fluctuate in a predictable way from second to second and hour to hour.
The researchers also discovered that both the time between an individual's tweets and the variation between prolific tweeters and lurkers - non-active Twitter users - follows a standard mathematical model called the log-normal distribution.
Plugging these findings in to SONG let Erramilli and colleagues run their own version of Twitter on a network of 16 computers. By gradually increasing the number of tweets per second they discovered that CPU overload caused the network to falter at over 100 tweets per second and totally collapse at around 150. Presumably this means Twitter owns more than 16 computers.
The researchers say that this proof-of-concept shows that SONG can be used to accurately model Twitter, though beefier hardware might be necessary to get close to the real thing. They plan to release the code for SONG soon to let other researchers build their own virtual Twitters and model "what-if" scenarios such as high loads caused by trending topics or a sudden rise in popularity in particular geographical regions.