Save money on split testing using Amazon’s Mechanical Turk
By now, most of us have heard that we should be testing. A/B testing. Split testing. Multivariate testing.
You should test your ads, test new features, test minimum viable products, test emails, and on and on.
And we totally agree.
But there are problems with it. One big problem is cost.
For example, it quickly gets very costly to split test Google ads if each click is $3.
Or what if you want to take an ad out on a premium ad network like Coudal's The Deck or Fusion Ads.
A Deck ad is $7200.
So this is something that for many small companies isn't something you can whiff at, and test a new ad on the second month to see if it does better.
You'd like to figure out as cheaply as possible and as soon as possible how to put your best foot forward.
I'm not saying we have a solution at all to this.
But we did something we find interesting that maybe we or someone else will find useful down the road.
How successful is testing things like ads, headlines, concepts, etc. with Amazon's Mechanical Turk?
Especially since from what I read, getting someone to do a simple task on the Mechanical Turk is cheap. $0.01 a task for example.
We setup an experiment.
Can the folks on the Turk pick the best converting headline if given a selection of headlines?
(Primarily though, this whole experiment was so that we could learn more about the Turk. So to preempt any, "you should have done it this way" or "why are you using the turk + wufoo" for this, we'd love to hear any input, but this was first and foremost a reason to understand how to use the Turk to do anything. Anything else is just "interesting", probably pre-useful, and we wanted to share it.)
I used a selection of headlines I already knew the results for with a Wufoo form. This was a test 37signals ran for their headline.
(This test isn't approved, endorsed or affiliated with 37signals in anyway.)
For background about their test, they saw statistically significant results showing them that 2 of the headlines were better than the rest. Headline A and Headline B converted (30% and 27% respectively) better than the original headline. As confirmed by Jason Fried, statistically Headline A and Headline B improvements are the same at this point. But they liked one of them better than the other, so they claimed Headline A the winner. Statistically however, Headline B could have been chosen as a winner as well.
We are in the business of crowd wisdom, so I'm already hopeful this experiment would show some wisdom from workers on the Turk. But given the incentive of say $0.01 to just complete the survey no matter what, I figured everyone would just click on something randomly to complete the task without thinking. And not a single ad would provide a strong signal.
If that didn't happen, I figured maybe some of the respondents will have heard of 37signals or knew the answer to this post and they'll pick the headline 37signals claimed as the winner.
Here are the results. Which didn't agree with what I thought I'd see.
----
----
Headline B (named in the 37signals background above) was the winner (38.89% +- 7.1%).
The next best headline was chosen by 16.67% with a confidence interval of +- 5.4%.
So Headline B was chosen with significance from workers on the Turk to do better. And obviously from a test in the real world from 37signals, they saw great results from that headline.
Surprisingly Headline A though, which performed very well in real life, was just in the middle of the pack.
More experiments need to be run to prove any kind of usefulness here. One reason for the behavior seen could be that Wufoo's randomization of the survey choices is biased. In my opinion, this isn't likely, but we plan on doing more tests to see how they play.
I think though that this provides some encouragment that the Turk could prove itself as a crowd that can be used and tuned to make good predictions when you need one.
There's definitely some more experiments we're doing with the crowd at Amazon's Mechanical Turk coming soon...
You should test your ads, test new features, test minimum viable products, test emails, and on and on.
And we totally agree.
But there are problems with it. One big problem is cost.
For example, it quickly gets very costly to split test Google ads if each click is $3.
Or what if you want to take an ad out on a premium ad network like Coudal's The Deck or Fusion Ads.
A Deck ad is $7200.
So this is something that for many small companies isn't something you can whiff at, and test a new ad on the second month to see if it does better.
You'd like to figure out as cheaply as possible and as soon as possible how to put your best foot forward.
I'm not saying we have a solution at all to this.
But we did something we find interesting that maybe we or someone else will find useful down the road.
How successful is testing things like ads, headlines, concepts, etc. with Amazon's Mechanical Turk?
Especially since from what I read, getting someone to do a simple task on the Mechanical Turk is cheap. $0.01 a task for example.
We setup an experiment.
Can the folks on the Turk pick the best converting headline if given a selection of headlines?
(Primarily though, this whole experiment was so that we could learn more about the Turk. So to preempt any, "you should have done it this way" or "why are you using the turk + wufoo" for this, we'd love to hear any input, but this was first and foremost a reason to understand how to use the Turk to do anything. Anything else is just "interesting", probably pre-useful, and we wanted to share it.)
I used a selection of headlines I already knew the results for with a Wufoo form. This was a test 37signals ran for their headline.
(This test isn't approved, endorsed or affiliated with 37signals in anyway.)
For background about their test, they saw statistically significant results showing them that 2 of the headlines were better than the rest. Headline A and Headline B converted (30% and 27% respectively) better than the original headline. As confirmed by Jason Fried, statistically Headline A and Headline B improvements are the same at this point. But they liked one of them better than the other, so they claimed Headline A the winner. Statistically however, Headline B could have been chosen as a winner as well.
We are in the business of crowd wisdom, so I'm already hopeful this experiment would show some wisdom from workers on the Turk. But given the incentive of say $0.01 to just complete the survey no matter what, I figured everyone would just click on something randomly to complete the task without thinking. And not a single ad would provide a strong signal.
If that didn't happen, I figured maybe some of the respondents will have heard of 37signals or knew the answer to this post and they'll pick the headline 37signals claimed as the winner.
Here are the results. Which didn't agree with what I thought I'd see.
----
----
Headline B (named in the 37signals background above) was the winner (38.89% +- 7.1%).
The next best headline was chosen by 16.67% with a confidence interval of +- 5.4%.
So Headline B was chosen with significance from workers on the Turk to do better. And obviously from a test in the real world from 37signals, they saw great results from that headline.
Surprisingly Headline A though, which performed very well in real life, was just in the middle of the pack.
More experiments need to be run to prove any kind of usefulness here. One reason for the behavior seen could be that Wufoo's randomization of the survey choices is biased. In my opinion, this isn't likely, but we plan on doing more tests to see how they play.
I think though that this provides some encouragment that the Turk could prove itself as a crowd that can be used and tuned to make good predictions when you need one.
There's definitely some more experiments we're doing with the crowd at Amazon's Mechanical Turk coming soon...
0 komentar:
Posting Komentar