How is complexity science applied to business problems? Network Theory

You can very successfully use  network theory to analyse complexity in business. It can be fun, and easy to visualize.

I personally understood why businesses can become more complex, yet more successful, analysing the evolution the network of the most successful European brands of the past 25 years –the Magnum Ice Cream. Networks theory today is extremely advanced, and lots of tools are available (see my pictures of the “Magnum Network”).

The Complexity of the Magnum Ice Cream

Italians love gelato.  On hot summer afternoons, cities fill up with families strolling  around, each member with a gelato in hand. I have always been part of  this collective passion: when I was a teenager, my friends and I  preferred to meet outside gelaterias and indulge in huge ice cream  cones, rather than get drunk in bars.

In  a country where the gelato is sold soon after preparation, and where  teenagers prefer it to alcoholic beverages, you may wonder how packaged  ice creams can possibly sell at all. But they do, and account for 30 percent of the Italian ice cream market thanks to heavy marketing and widespread distribution. Of the 30 percent, half is in the hands of Anglo-Dutch multinational Unilever, under the brand Algida.

For decades, Algida’s strongest seller was the Cornetto,  an imitation of the artisanal ice cream cone, which was launched in  1959, the same year Fellini produced La Dolce Vita. Thirty years later,  the country was miles away from the economic boom described by Fellini,  but the Cornetto hung around and was joined by a considerable number of  competitors.

The  1980s  were a time of consumerist excess, with brands offering cherry,  amaretto, chocolate and biscuit all in the same ice cream. It was in  this environment that Unilever, in 1989, launched Magnum, the simplest ice cream bar ever –vanilla with a chocolate coating.

As an apocryphal quotation of Albert Einstein goes, “things should be as simple as possible, but not simpler”.  The Magnum was simple, but not straightforward. Most ice creams had  vanilla filling, but only few of them had good quality vanilla, and not  one that was covered by thick, good, real chocolate.  To produce good  quality coating, Unilever asked Belgian Callebaut to develop a chocolate  that could go down to -40 degrees without breaking, something did not  exist before.

The  Magnum stood out from the crowd because it was simple, yet  sophisticated. According to Unilever, it was already in 1992 “Europe’s  most popular chocolate ice cream bar”.

Simplicity,  though, did not last for long with the Magnum. With time, the Magnum  evolved from the original ice cream to an ecosystem of elaborated ice  creams: Almond, Mint, Caramel and Nuts, Yogurt; bigger and smaller  Magnums; even Magnums without sticks. The original Magnum, in this new  fauna of Magnum Ice Creams, was renamed Magnum Classic.

Magnum’s Syndrome?

When  this process of differentiation started, and the promise of simplicity  was broken, I got upset. When Unilever came out with Moments, small ice  creams stuffed with caramel and hazelnut, I decided the company had  reached the limit, and prophesied Magnum’s fall from greatness to dust.

In  conversations, whenever dealing with something unnecessarily complex, I  would refer to what I called the Magnum Syndrome: “Things start nice  and simple, but with time they accumulate complexity. This is when they  lose their strength, like in the Magnum’s case: it is not the delicacy it used to be,there’s too much noise around.”

I could find plenty that had fallen victim to the Magnum Syndrome. World’s economy, science, politics, societies in general,  Europe in particular. When Cherry Guevara was launched, together with  other terrible Magnum flavours like the John Lemon, the Wood Choc, and  the Jami Hendrix, I considered them the four Horsemen of the Apocalypse.  The Magnum ecosystem will collapse soon, I was thinking while biting  into my Classic. And global capitalism will surely follow.

Taming Complexity

I might have thought that science had become too complex, but I was still convinced that physics is an extremely successful tool with which to tame complexity.

Examples of physics successfully taming complexity abound. Take statistical mechanics. During 19th century, physicists studied the statistical properties of  the motion of molecules in a gas and discovered that despite their  seeming randomness, properties like temperature, pressure, and even the  obscure concept of entropy were all explainable in terms of probability:  the behaviour of billions of molecules could be described by just a few  variables linked to each other.

Key to the success was creating ideal systems, like perfect gas or  But how can one possibly find an ideal system with which to  describe the behaviour of the stock market, human societies or the  marketing strategy of Unilever?

The Network Revolution

With perfect timing, a new branch of physics was officially born together with the fauna of Magnum ice creams: network theory.

Network theory was the illegitimate child of the World Wide Web.  With the Web, it finally became possible to obtain data with which to  study how networks evolve. Physicists and mathematicians threw  themselves into data analysis and modelling, and with new results on  social topics too: networks of people exchanging email messages, web  sites referring to each other, blog feeds, all produced an abundance of  digital data. Results were so original that stern journals like the Physical Review begun to publish articles on social networks –a social topic, for the first time ever.

With network theory, physicists were entering the arena,  and facing the complexity of the “real world” – just like biologist,  economists, sociologists, and anthropologists had been doing for a  while.

The power of networks is that everything can be reduced to a network and studied, even the Magnum ecosystem.  As soon as we can connect two ice creams because they have in common a  particular ingredient, like caramel or dark chocolate, or are part of  the same offer, like the “Seven Deadly Sins”, we have a network.

For  instance, the first Magnum Classic leads to the first four Magnum  variations (Double Caramel, Dark, Double Chocolate, and Almond) that  followed it a few years later while the Double Caramel leads to Taste  (in the Five Senses) and Sloth (in the Seven Sins), which are similar  ice creams that were subsequently launched.

In  this way, we can draw a graphical representation of the increase in  complexity of the Magnum system over time. From the simple “star” at the  beginning of the 1990s, with one central ice cream and four peripheral  ones…

(the size of circles is proportional to the influence)
to the intricate network arrived at post 2000:

If  we traced the evolution of the Web over the same period, we would get  similar figures. We would see complexity emerge from the first 50+  webpages published by Berners-Lee in 19901 to the billion pages in 2000.

The Magnum Strategy: Complexity is Good

In  complex, organised, networks, “the whole is more than the sum of its  parts”, writes Herbert Simon in “The Architecture of Complexity” (1956).  This “more”, this emergent property of the system –the network–, is  what makes different elements get together in a network and cooperate.

Simon was right. I, on the other hand, had been completely missing the big picture when criticising complexity.

What if we thought of the Magnum Ice Creams as an organism?  Being sold under the same brand, Magnums form a collaborating  community: each Magnum tell us something about the other Magnums  –something good – with all Magnums starting with the most excellent of  reputations, based on the original Classic’s. A customer will expect,  and find, good quality ingredients in any Magnum because she knows that  the original Magnum’s strength was good quality vanilla, and thick  Belgian chocolate. In this sense the Classic has a link with all other  Magnums collaborating with them in a virtuous circle: the Classic’s reputation gets stronger as it recommends other high  quality ice creams, which in turn, being actually decent, recommend the  original Classic. This potential circle can exist for Magnums other than  the classic. Now under the area of influence of the Caramel are the  “Caramel and Nuts” and “Sloth” Magnums, which Unilever introduced after  its success.

With its fast growing reputation, Unilever would continue to introduce new Magnums at a fast pace, making the Magnum empire more complex, but also more powerful.  Thanks to this strategy –complexity with high quality and strong  connections– the Magnum became in 2000 the largest single ice cream  brand in Europe.

This  success was not possible if all twenty-plus Magnums were sold by  different companies, with different brands. We would see a situation  similar to the one before the Magnum arrived: many over-complicated ice  creams, where it is difficult to make a choice. The  stronger the connection between the elements of a system, the bigger  the possible success. No connections between the elements, no success.

Simon shows that Magnum’s evolution towards complexity was not just a potential syndrome, but a powerful strategy –the Magnum Strategy:

Start simple. Learn from the environment. Grow complex maximising internal collaboration


A few references:

  • Maljers, F (1992) “Inside Unilever: The Evolving Transnational Company”, Harvard Business Review, September 1992
  • Berners-Lee, T., Fischetti, M., & Foreword By-Dertouzos, M. L. (2000). Weaving the Web: The original design and ultimate destiny of the World Wide Web by its inventor. HarperInformation.
  • Dorogovtsev, S, Mendes, J (2003) “Evolution of Networks”
  • Clarke, C. (2012). “The science of ice cream”. Royal Society of Chemistry.

View Answer on Quora

Posted in Articles | Leave a comment

Can a lion jump 36 feet?

Using data from Rory Young’s answer: yes, a lion definitely can jump 10+ meters.

The horizontal distance D of a projectile with speed V, launched at an angle \theta is:

 D = V^2 \cdot sin(2\theta) / g

Which gives, for a lion running at 54km/sec, i.e. V=15 m/s, jumping at an angle of 45 degrees, and assuming g \approx 10m/s^2

D = 22.5 m

Considering that running at maximum speed and jumping at 45 degrees is very hard, even for a lion, this is an over-estimation. But jumping at 15 degrees would allow the lion to fly over a distance D:

 D \approx 11 m = 36ft

In addition, if a dog can jump 29 feet (9m), clearly a lion can do better….

View Answer on Quora

Posted in Articles | Leave a comment

How can entropy both decrease predictability and promote even distributions?

Answer by Mario Alemi:

Consider this image:

There is no uniformity, and the level of prediction is high. You immediately see that the person on the left is rich, the one on the right is poor. You know who'll have a good meal tonight. Who has higher life expectancy. And so on.

Consider this now:

High uniformity, low prediction. Prison uniforms, well, they uniformize. From the picture, you can't get much information, apart from the fact that they are in prison.

Therefore: the first picture carries a lot of information about the people. The second doesn't. High entropy for the first, low for the second.

An unequal distribution of income produces the first picture. A uniform distribution produces the second one.

Interesting, this is what happens to societies, organizations, and organisms. They start with low entropy (uniformity, equality) and develop towards higher complexity (i.e. specialization, inequality).

(You can have a look at the way one can define entropy in graph –one of many– and see that for networks it's the same: Mario Alemi's answer to Entropy (information theory): What is an entropy of Graph? Is it related to concept of entropy in Information Theory?)

View Answer on Quora

Posted in Articles | Leave a comment

What is an entropy of Graph? Is it related to concept of entropy in Information Theory?

Assuming you have an undirected, unweighed graph, it is the entropy of the frequency distribution of connections of the nodes.

(NB You can follow the below with this google spreadsheet: Entropy of Graphs, and I use node for vertex and connection for edge).

Said in a simple way, entropy is the average “surprise” when you look at the nodes’ number of connections.

If we define surprise as -log(p), you have high surprise when you observe something very improbabile (p~0) and low surprise when you observe something very probable (p~1).

The average will be the sum over all nodes i of

-p_i * log(p_i)

or:

average(-log(p_i))

where p_i is the number of connections of the i-th node divided by the total number of connections.

From the definition above, it comes that (Shannon) entropy is the uncertainty: the more you will be surprised (on average) the more it means that the state of the nodes is unknown to you.

Take column D of  Entropy of Graphs. This is an Erd?s–Rényi random graph (see wikipedia for more), number of connections in nodes follows the Poisson distribution. The entropy of the graph is the entropy of the Poisson distribution, which is relatively high: imagine looking at this graph, all nodes seem to have the same number of connections, you don’t know how to identify them!
If we take a small-world graph coming from an Albert-Barabasi model (column B), the number of connections follows the Zipf (Pareto) distribution. You have many nodes with a low number of connection, but a few with lots of connection –which means high surprise. This also means low uncertainty. Like in society, you can well identify rich people from the mass of normal people….

From this, and the preferential attachment growth described by Albert-Barabasi, it derives that even when networks start with a uniform distribution of connections (like small society), they evolve to states of increasingly lower entropy (uncertainty), with the addition of new nodes.  Successful graphs, the small-world graphs, have lower entropy (uncertainty) than unsuccessful ones.

View Answer on Quora

Posted in Articles | Leave a comment

Why is the future and the past so different?

The “directionality” of time can be easily explained in terms of entropy, as you point out –and not vice-versa. It’s also easy to understand, if you consider the statistical mechanics’ definition of entropy.

Slightly formal: Systems evolve from the past, where they find themselves in an improbable state, to the future, where they find themselves in a more probable state than the previous one. If you see water steam entering a manhole in NYC,  you know you are watching a reversed movie.

Less formal: Immagine you have a bookshop, with all books ordered alphabetically. Take a picture in the morning. Then customers arrive,  leave a few books around,  put it back in the wrong order etc. At the end of the day you take one more picture. If there is no one tidying the bookshop, you could identify immediately which picture was taken in the past, and which in the future. Boltzmann (the physicist which formalized the statistical definition of entropy) would say that the entropy in the morning was lower than entropy in the evening.

The entropy of a system in a certain state is a measure of the probability of that system to be in that state. From that, it follows that a system will naturally evolve from a state of low entropy to a state of high entropy, because (again, this is how entropy was defined by Boltzmann) the latter is more probable. Clearly, if you put energy in the system, you can go towards a state of lower entropy, but if you took a bigger picture, you would see other systems going towards states of even higher entropy (your body while tidying up the bookshop, for instance).

In this sense we say that time is a dimension with direction. It is different from space, and this is why, to be technical, in the space-time, time has a different metric from space.

There are two interesting points following from that. One is that you can imagine a universe where time is like space, not-directional. This is what Augustine of Hippo said 1,000 years ago to answer the question: why did God create the universe, considering that he’s perfect and should not need a universe to feel even more perfect? Answer: there was not time before the universe was created, because is the universe which defines time, as Boltzmann explained. A similar argument is in “The Large Scale Structure of Space-Time” by Hawking and Ellis, when they say that there is a singularity in the past which “constitutes, in some sense, a beginning to the universe”.

PS Note that things like a person jumping from ground floor to the 10th floor of a building are not time-defining. It appears to go against the flow past-to-future, but only because we know it’s difficult for someone to have that strength. If the person is wonderwoman, we don’t notice the movie is actually reversed, as the trajectory makes perfect sense.

View Answer on Quora

Posted in Articles | Leave a comment

Mobile apps are a temporary step

(from Quora — What does Marc Andreessen mean when he says “Mobile apps on platforms like iOS and Android are a temporary step along the way toward the full mobile web”?)

Let’s have a look at the past, and see if we can understand the present and, within limits, forecast the future. This is, I believe, the only way to understand Andreessen.

Language — creation and break-up

Apps today are distributed through many incompatible application stores. This reminds me the tower of Babel: great enthusiasm with the invention of language, but then people start talking different languages. If you look at any information revolution, not just the creation of language, the pattern has always been the same. Humanity creates a tool for exchanging information, and then breaks this tool into many tools, similar but yet incompatible between each other. It it neither good nor bad –it is just inevitable, and for good reasons (you need the language to be more specific in-between your community) which we shouldn’t analyze here. The question is: does it sound familiar? Well, yes!

How Apple saved the publishers

Apple, and all the other corporations who followed it, gave new life to the web. Five years ago, newspapers, movies and music publishers were lamenting: “How will we survive if no one wants to pay for reading an article?”.

A few years later, with the introduction of the App Store, every one got happy. Users can access information from their mobile, and publishers can easily charge for it. Soon, similar, but incompatible application stores appeared. Apple, rightfully, wants to defend its dominant position, and refuses to have others’ application stores on the iPhone. Apple would like (it seems) to sell an iPhone to every single human being, so we can all buy on the App Store. This might be a bit too much, even for Apple.

Why applications have to change

If I were an enterpreneur or a developer, I didn’t want Apple to impose on me its technology, even if it is a good technology. This is why I am waiting for html5 to be good enough to be used for serious stuff. Once I have my app developed in html5, I can sell it on any possible app-store –be Google Play, Amazon Store or “the” App Store. Maybe, the free version of my app will be available on the web.

In addition to that, as a user I don’t want Apple to decide which app I can have and which I cannot. Government’s sensorship is more than enough. This is more important than it may seem to Apple: without free exchange of information, humanity will go backward, not forward.

Apple and Manuzio, evolution of the revolution

And we will go forward. We are living in the middle of the biggest information revolution after the mobile types one. When modern press was invented. The revolution we are living will still deliver high quality, free information to humanity, exactly like books and newspapers did from Guthenberg on. If Netscape was Guthenberg, Apple is the new Aldus Manuzio, the Renaissance publisher who set up a definite scheme of book design, and introduced small and pocket editions (mobile editions…) of books. Manuzio became rich, but many followed his step, like many have followed Apple’s step.

This revolution will also be remunerative! Following the Apple business model, some young publishers are already able to leverage on new tools for information exchange (books, movies, magazines, music etc). As publishers have to change their vision, many old publishers, AKA the dinosaurs, will die.

Last, we need someone pointing us to the right app/websites when looking for it, exactly like Google did 15 years ago. At the moment, apps are more like monads which hardly interact between each other. And in the absence of a network, it’s hard to understand who’s the leader… but we are already seeing huge efforts in this direction, and this will soon happen.

Like saying…

In conclusion, I think we can interpret Andressen like saying: “The web is alive, and is doing pretty well. Apps are good, but they are closed. This means they will soon change, and we will have application-like websites, able to pay salaries to journalists and musicians, but also accessible by everyone with a pocket size
computers, like Manuzio would call smart-phones.

PS The Tower of Bable, from Genesis:

And the whole earth [Internet] was of one language [http]. And they [the users] said: “Go to, let us build us a city and a tower, whose top may reach unto heaven; and … lest we be scattered abroad upon the face of the whole earth. [Lest we call this the "World Wide Web"]

And the Lord [Corporations] said: “Behold, the people is one, and they have all one language; and this they begin to do: and now nothing will be restrained from them, which they have imagined to do.” [or: we thought the web was a huge supermarket, but people started using it for exchanging movies and read news and books for free. How will we survive? And honestly, shouldn't people pay when using the product of the work of some one else?]

“Go to, let us go down, and there confound their language, that they may not understand one another’s speech.”

[And the application stores were created.]

 

Posted in Articles | Leave a comment

Using Google REST API for Analytics

If you are not a Google Engineer, OAuth2 for Google APIs can be a nightmare without good examples.  If you think Facebook was bad at providing instructions, wait to get lost in a Microsoft-style ocean of useless information provided by Google on its new V3 gdata….

I assume you have curl installed, and here is a step-by-step guide on how to retrieve Google Analytics data using curl, which for me remains the best way to test any API.

  1. Visit Core Reporting API, and get confused by the amount of (not very useful) information ;
  2. On your browser, log-in at Google with the account you want to use to access Analytics info;
  3. Go to the APIs console. Click on create a “project”. Put “status on” for Analytics. Possibly rename the project going to the left menu “API Project” –the name given by default to the project;
  4. Click on “API Access”. Click on “Create an OAuth 2.0 client ID”. Give any name (this name is the one which would be seen by users, but here the only user will be you, downloading your data). Chose “Installed Application” and then “Create ID”. These infos will appear:Client ID for installed applications
    Client ID:
    1234567890.apps.googleusercontent.com
    Client secret:
    xywzxywzxywzxywzxywz
    Redirect URIs: urn:ietf:wg:oauth:2.0:oob

    http://localhost

  5. Go to Using OAuth 2.0 for Installed Apps (hardly useful), form the following URL and visit it with the browser where you logged in with your Analytics account (I had to guess the scope, could not find a page where all scope values are listed!):
    https://accounts.google.com/o/oauth2/auth?
    scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fanalytics&
    redirect_uri=urn:ietf:wg:oauth:2.0:oob&
    response_type=code&
    client_id=1234567890.apps.googleusercontent.com
  6. Allow the access, of course, and copy the code which should look like
  7. Go to the terminal and use the command below using your own “code” from above, “client_id” etc:
    curl  -H "Content-Type: application/x-www-form-urlencoded"
    -d 'code=4/v6xr77ewYqjkslsdUOKwAzu
    &client_id=1234567890.apps.googleusercontent.com
    &client_secret=xywzxywzxywzxywzxywz
    &redirect_uri=urn:ietf:wg:oauth:2.0:oob
    &grant_type=authorization_code'
    
    https://accounts.google.com/o/oauth2/token
  8. You’ll get a JSON like this one:
    {
      "access_token" : "ya29.AHES6Zkjhkjhahskjhskkskjh",
      "token_type" : "Bearer",
      "expires_in" : 3600,
      "refresh_token" : "1/HH9E7k5D0jakjhsd7askdjh7899a8sd989"
    }
  9. If you curl:
    curl 'https://www.googleapis.com/oauth2/v1/tokeninfo?
    access_token=ya29.AHES6Zkjhkjhahskjhskkskjh'

    you’ll get something like:

    {
     "issued_to": "562211803675.apps.googleusercontent.com",
     "audience": "562211803675.apps.googleusercontent.com",
     "scope": "https://www.googleapis.com/auth/analytics",
     "expires_in": 3556
    }

    (see below how to renew the token without having to go to ask for another “code” as in point 5)

  10. Done
    curl 'https://www.googleapis.com/analytics/v3/management/
    accounts?access_token=ya29.AHES6Zkjhkjhahskjhskkskjh'

    will give you all info about your accounts, more info on the Management API REST page.

  11. How to get the data? A nice hint I could not find anywhere is that the URL of the new Analytics Dashboard provides account, web property and profile:
    https://google.com/analytics/web/#dashboard/default/
    aACCOUNTwWEBPROPERTYpPROFILE/
  12. Which means you can get the data with the following address:
    curl 'https://www.googleapis.com/analytics/v3/data/ga?
    ids=ga:PROFILE&metrics=ga:visits&start-date=2011-12-01&
    end-date=2011-12-08&
    access_token=ya29.AHES6Zkjhkjhahskjhskkskjh'

Renew the token

You have to use the “refresh_token” received in point 8:

curl -d "client_id=562211803675.apps.googleusercontent.com
&client_secret=ZQxoOBGbvMGnZOYUrVIDXrgl
&refresh_token=1/HH9E7k5D0jakjhsd7askdjh7899a8sd989
&grant_type=refresh_token" 

https://accounts.google.com/o/oauth2/token

and you’ll get a new access_token.

PS To my surprise, this is still today the most successful post of the blog! Now, if you have time, you’d do me a great favor having a look at this post http://www.visualab.org/index.php/how-is-complexity-science-applied-to-business-problems and letting me know your feedback. My literary agent was not really enthusiastic –would you? Negative feedbacks very well accepted too:)

Posted in Informatics | Tagged , | 11 Comments

aNobii: virtual libraries and actual sales

This post was inspired by “ aNobii Library and copies sold“ (in Italian), by jumpinshark. I do also enjoy scraping various sites,  aNobii included before they hired me:), and I think that looking from the outside one often has ideas that would otherwise struggle to emerge from the inside. The idea that not all books have the same ratio sales / presence-in-aNobii-libraries is indeed interesting: some books are extremely successful on aNobii but don’t sell much, others the other way around. So I could not resist using some probability distributions and see if I could generalise the results obtained by jumpinshark. Note that I did not use any internal aNobii data for this.

Unique aNobii visitors in Italy

Before starting, however, I must underline that there was no drop in the number of unique visitors we had (contrary to  what DoubleClick estimates, as reported by jumpinshark). Even in Italy, we never dropped below 30.000 unique visits per day and we are still (very well!) above 70,000 per month. Above you can find the graph taken from Google Analytics, without numbers, which shows unique visitors per month in Italy during the past twelve months: it’s constant.

That said, let’s go back to jumpinshark. As mentioned, his analysis is correct, and may be further formalized, using distributions instead of averages.

The analysis –a little more formal

You can find on this Google Spreadsheet all results. Let’s list the assumptions:

  1. The distribution of numbers of books purchased per year is a Pareto-Zipf with exponent 2: If I have 100 people in Italy who have bought 1 book, I’ll have 100 / 2 ^ 2 which have bought 2, 100 / 3 ^ 2 which I have bought 3 etc. See Column B of the spreadsheet. (With this distribution, strong readers who read 10 books a year are more than ~ 2 million).
  2. It is n times more likely that a person with n books joins aNobii than a person with 1 book. See column C.
  3. Each book has its “resonant reader.” For example, I assume that the “resonant” readers for Umberto Eco buy 15 books a year (kind of intellectual), readers of Benedetta Parodi (TV show lady) read just 1 book per year . I say “resonant” because usually in these cases one uses the Cauchy distribution, which is used in physics precisely for resonances. Resonant does not mean exclusive: other readers might like Eco and Parodi, only a bit less. See column D & E and cell G9, H9.

End of assumptions. The distributions above are quite standard in these cases, and show that when an author resonates with avid readers, like Eco does, s/he has a better chance of appearing on aNobii than a more pop book with same amount of sales. This stems from the fact that a voracious reader is more likely to end up on aNobii, although there are many members of aNobii with very few books in the library, just as in the hypothesized distribution.

So for every aNobii-user who owns Eco, there are about 230 people on the street that have actually bought it. For Ms Parodi the ratio is 1 to 1000 instead. Still, as noted by jumpinshark, there are other factors: I bet that “ aNobii, the reading worm“, a book composed by aNobii reviews, has a ratio close to 1:) Anyway, even if the numbers are far from perfect, the result is not bad, IMHO….

Posted in Articles | Leave a comment

Librerie virtuali di aNobii e vendite reali

Questo post è stato ispirato a “Libreria aNobii e copie vendute“, di jumpinshark. Anche io mi diverto a fare scraping di siti vari, aNobii incluso prima che ci finissi a lavorare:), e credo che guardando dall’esterno spesso si abbiano idee che dall’interno fanno fatica a emergere. L’idea che non tutti i libri abbiano lo stesso rapporto vendite/posseduto da aNobiani per esempio è verissima –e non ho resistito a vedere se usando un po’ di distribuzioni di probabilità potevo ottenere il risultato di jumpinshark (anche io senza usare dati interni di aNobii!).


Visitatori Unici/Mese in Italia

Prima di iniziare però devo assicurare che, grazie a dio, non c’è stato un calo sensibile di visitatori unici ad aNobii. Anche in Italia, non siamo mai scesi sotto i 30,000 unici/giorno e siamo ben (molto ben!) al di sopra di 70,000 al mese. Non è il cuore del post, ma metto affianco il grafico preso da GAnalytics, senza numeri, che mostra che gli utenti unici /mese di aNobii in Italia nell’ultimo anno son sempre lì.

Detto questo, torniamo a jumpinshark. Come detto, l’analisi secondo me è giusta, e può anzi essere formalizzata ulteriormente, usando distribuzioni al posto delle medie.

Analisi un po’ più formale

Trovate su questo Google Spreadsheet tutti i risultati. Ma siccome gli spreadsheet son difficili da seguire, partiamo dall’inizio con le ipotesi di partenza:

  1. La distribuzione dei numeri di libri comprati all’anno è la Pareto-Zipf con esponente 2: se ho 100 persone in Italia che han comprato 1 libro, ne avrò 100/2^2 che ne hanno comprati 2, 100/3^2 che ne hanno comprati 3 etc. Cfr colonna B dello spreadsheet. (Con questa distribuzione i lettori forti che leggono più di 10 libri all’anno sono ~2 milioni).
  2. È n volte più probabile che una persona con n libri venga ad aNobii rispetto ad una persona con 1 libro. Cfr colonna C.
  3. Ogni libro ha il suo “lettore risonante”. Per esempio ho ipotizzato che il lettore “risonante” di Eco compri 15 libri all’anno, quello della Parodi 1. Dico “risonante” perché normalmente in questi casi uso la distribuzione di Cauchy, usata appunto per le risonanze. Risonante non vuol dire esclusivo: anche agli altri lettori Eco e Parodi piacciono, solo piacciono meno. Cfr colonna D&E e celle G9, H9.

Fine delle ipotesi. Le distribuzioni usate sono abbastanza standard in questi casi, e mostrano che quando uno scrittore risuona con lettori accaniti, come Eco, ha una maggiore probabilità di apparire su aNobii in rapporto al numero delle vendite rispetto ad uno scrittore che risuona con lettori occasionali. Questo deriva dal fatto che un lettore accanito ha più probabilità di finire su aNobii, nonostante su aNobii ci siano molti iscritti con pochissimi libri in libreria, esattamente come nella distribuzione ipotizzata.

Quindi per ogni aNobiano che possiede Eco ci sono ~230 lettori che l’hanno effettivamente comprato. Per la Parodi il rapporto invece è 1 a 1000. Detto questo, come fatto notare da jumpinshark, ci son altri fattori: scommetto che “aNobii, il tarlo della lettura” ha un rapporto vicino a 1:) Comunque, anche se i numeri son lungi dall’essere perfetti, ci siamo no?

Posted in Articles | 3 Comments

Fantacalcio Bayesiano

Mi è stato detto che la chiave del fantacalcio (a cui non ho mai giocato) è quella di individuare quale calciatore effettivamente scenderà in campo la domenica. O si tira a caso, oppure ci si informa sulle varie testate sportive –e poi si tira a caso.

Qual è la probabilità che un calciatore, diciamo Pippo, scenderà in campo dato che un certo numero di testate lo danno per favorito?

Non lo so, ma è un buon esempio per vedere come funzionano la “interpretazione logica della probabilità” di Richard Cox. Come dice Wikipedia: “As the laws of probability derived by Cox’s theorem are applicable to any proposition, logical probability is a type of Bayesian probability.”

Introduciamo due assiomi. Immaginiamo che il lunedì pensiamo alla partita di domenica prossima, e che la Gazzetta pubblichi le probabili formazioni il giovedì:

  1. Se possiamo dire quanto crediamo che Pippo giocherà domenica prossima, possiamo anche dire quanto crediamo che Pippo non giocherà. Banale.
  2. Se possiamo dire quanto crediamo che giovedì la Gazzetta dirà che Pippo giocherà, e possiamo dire quanto crediamo che Pippo giocherà quando la Gazzetta lo dà giocatore, allora possiamo anche dire quanto crediamo (il lunedì) che domenica sera avremo avuto sia Pippo in campo la mattina, sia Pippo nella probabile formazione della Gazzetta il sabato prima. In altre parole: se conosco la Gazzetta, conosco Pippo.

Con questi due assiomi possiamo scrivere due  formule, che mappano i nostri “quanto crediamo che….” con numeri reali positivi. Tipo: “su una scala da 0 a 20 la probabilità che Pippo giochi secondo me è 18″ vuol dire che sono abbastanza convinto che Pippo giocherà. In probabilità, normalmente, invece di una scala da 0 a 20 si usa una scala da 0 a 1, ma è pura convenzione. Le formule:

La probabilità che pippo giochi date certe informazoni più la probabilità che pippo non giochi, date le stesse informazioni, è uguale a 1 –la crediamo sempre vera:

(1) prob(gioca | info) +prob(non gioca | info) = 1

La probabità che domenica sera ci si ritrovi con Pippo che è stato in campo e con la Gazzetta che sabato lo aveva messo nella probabile formazione è uguale alla probabilità che Pippo giochi quando la Gazzetta lo mette in formazione, per la probabilità che la Gazzetta lo metta in formazione:

(2)     prob(gioco, in_formazione | info) =
prob(gioco | in_formazione, info) • prob(in_formazione, info)

Dove la virgola “,” vuol dire “e allo stesso tempo”, la barra “|” vuol dire “dato”, e il punto “•” vuol dire “moltiplicato per”.

La cosa importante da capire è che quando diciamo “probabilità” intendiamo “quanto fortemente credo che”. Probabilità 90% non vuol dire che testando il modello cento milioni di volte la mia ipotesi si avvera 90 milioni di volte. La probabilità che Pippo giocherà è data da vari fattori, come lo stato della sua forma fisica, la squadra avversaria e la formazione conseguentemente scelta dall’allenatore etc etc. Su questa base, e solo su questa, io devo stimare la probabilità che Pippo giocherà. Ma siccome sono pigro faccio fare questo lavoro alla Gazzetta e uso la sua stima per la mia fanta-formazione.

Teorema di Bayes

Le due formule sopra sono la base su cui costruire un algebra delle probabilità. La prima formula da ottenere è il teorema di Bayes. Definiamo:

G: Pippo gioca
F: La Gazzetta lo mette nella probabile formazione
I: Informazioni varie sullo stato fisico di Pippo etc etc

Possiamo quindi scrivere:

(3)     prob(G| F, I) = prob(F | G, I) • prob(G, I) / prob(F | I)

La formula (3) deriva direttamente dalla (2) con F e G intercambiati:

(2′)     prob(F, G | I) = prob(F | G, I) • prob(G | I)

Poiché la probabilità che “F e G siano vere domenica sera” è uguale alla probabilità che “G e F siano vere domenica sera”, ossia prob(F, G | I)=prob(G, F | I), possiamo eguagliare i secondi termini delle formule (2) e (2′) e ottenere (3).

Marginalizzazione

La seconda formula è la formula di marginalizzazione. Consideriamo che:

(4)   prob(G | I) = prob(G, F | I) + prob(G, notF | I)

La (4) ci dice che la probabilità che Pippo giochi è uguale alla probabilità che Pippo giochi e la Gazzetta lo metta in formazione più la probabilità che Pippo giochi e la Gazzetta non lo metta in formazione. Ovvio.

Ora, immaginiamo che invece di guardare solo la Gazzetta noi saggiamente andiamo a guardare 5 giornali sportivi. Ognuno di questi giornali metterà o meno Pippo nella sua probabile formazione: abbiamo 32 possibili combinazioni, da 0, 0, 0, 0, 0 (nessun giornale mette Pippo in formazione) a 1, 1, 1, 1, 1 (tutti i giornali mettono Pippo in formazione). Ognuna di queste 32 combinazioni che chiamiamo F1, F2, F3…. F32 è mutuamente esclusiva, ossia solo una può essere vera. Aggiungiamo anche che almeno una deve essere vera: queste sono tutte le combinazioni possibili. Quindi possiamo scrivere:

(5)    prob(G | I) =
prob(G, F1 | I) + prob(G, F2 | I) + …. + prob(G, F32 | I)

Siccome abbiamo detto che le F1…F32 sono mutualmente esclusive, la (5) è ovvia quanto la (4): la probabilità che Pippo giochi è uguale alla probabilità che giochi dopo che nessun giornale lo abbia messo in formazione, più la probabilità che giochi dopo che solo uno dei cinque giornali lo abbia messo in formazione, e così via.


Tutto questo non aiuta, ancora, a prevedere se Pippo giocherà o no, ma secondo me insegna qualcosa: guardate pure le probabili formazioni dei giornali, ma metteteci del vostro. Siate voi gli esperti capaci di prevedere se un giocatore giocherà o meno, a prescindere da quello che dicono i giornali….

Posted in Articles | Leave a comment