Upwork and the Curse of WYSIATI

Lukas Kovarik, March 6, 2019

How an innocent-looking cognitive bias can ruin your machine learning project before it even starts. This was demonstrated on a mystery shopping experiment we conducted on Upwork.
Upwork and the Curse of WYSIATI

Hiring a machine learning expert can be a daunting task, not only because of talent scarcity, but also because of the lack of knowledge founders and managers have in regards to the peculiarities of machine learning. So it is no surprise that many turn to online platforms such as Upwork, especially since these platforms promise a large pool of experts and a transparent hiring process.

These promises certainly sound attractive, but I would argue that they are horribly implemented. Thus, when applied to complex domains such as machine learning (as opposed to, for example, banner design), they can become even dangerous. Let’s highlight the most critical problems:

  1. Market inefficiency — despite the transparency promise, Upwork is a grossly inefficient market. Prices, which are supposed to carry information, carry little more than disappointing randomness. In addition, social proofs are unreliable and trust is generally low.
  2. Irrationality — the platform’s dynamics encourage irrational behaviour and amplify cognitive biases such as halo effect (from buyer’s perspective) and “what you see is all there is”, or WYSIATI (from seller’s perspective).

Both problems, as examined on an experiment described below, can teach us valuable lessons about commoditization of knowledge work, outsourcing, and hiring in general.

Considering AI in your company?

TALK TO OUR EXPERT

Experiment Design

We published two job postings related to machine learning and NLP which I’ll refer to as Project Easy and Project Hard. We had already implemented those exact projects in Bohemian AI prior to the experiment.

  1. Project Easy — information extraction task which required three different Entity Recognition techniques to be stitched together and applied on different parts of a plain-text input. The magic was that almost all necessary tools were available as open source, so besides minor customizations and scaffolding, the only necessary skill was googling.
    Reasonable time estimate: 2–5 man days.
  2. Project Hard — a non-trivial implementation of an academic paper for which no code had been published and a large portion of functionality was vaguely implied rather than explicitly mentioned in the paper. Multiple models with different algorithms needed to be trained and the system was required to work on a fairly large dataset — which not only meant a specific system design but also had a serious impact on server costs.
    Reasonable time estimate: 1.5–3 man months plus min. $1,000 extra infrastructure costs for data mining and training.

Each job posting described the expected inputs and outputs of the system (no user interface was required, only data in and data out), provided a sample of the dataset (clearly marked “sample”) and defined the expected deliverables (e.g. Python code with a README file). For Project Hard, we obviously attached the paper as well.

At the end of each job posting we stuck a sentence “Please start your cover letter with the words ‘ice cream’ so we know you’ve read this and also that you like ice cream as much as we do.” in the suspicion that not every applicant is a thorough reader.

We purposefully didn’t include any other information beyond the (somewhat brief) technical specification to see if candidates were able to ask the right questions. We did not promote the job posting in any way.

Note: we ran the experiment once, obtaining a sample that’s hardly significant from a scientific standpoint. We are fully aware of that. On the other hand, no client trying to hire on Upwork will look at substantially more data and in this respect the experiment draws a very realistic picture. At the very least, we present a useful anecdote that is fully in line with our long-term experience with how people at Upwork do business.

Candidates

In 4 days, we received applications from every inhabited continent except Australia, 39 in total, 20 for Project Easy and 19 for Project Hard. Nearly 25% of applications came from agencies, the rest came from independent freelancers.

We replied to every candidate regardless of whether he passed the “ice cream test” — which, as we suspected, many candidates didn’t. But the scale surprised us nevertheless — almost 30% of candidates didn’t pass, which is notable especially for Project Easy where the ice cream sentence accounted for a full one third of the job posting’s text! The results illustrate the pressure that forces candidates to quickly reply to every job posting in sight, or even develop an automated bot to reply to the posts.

Question Time

We opened each conversation with a standardized response that encouraged the candidate to ask questions in order to be able to prepare a fixed-price proposal. Naturally, we had our own idea of what the right questions would have been. But since we were there to simulate a real-life situation where a priori knowledge is not available, we put the candidates entirely in control of their question time. This has — at least in hindsight — become the most enlightening part of the experiment.

Not only didn’t the candidates ask questions we found relevant, they had problems asking any questions at all. Out of 32 candidates who later submitted a proposal, 12 (or 38%) didn’t ask a single question before quoting a price, relying exclusively on information presented in the job posting. Among them were five candidates with Top Rated badge and two with Upwork lifetime earnings over $100k. Interestingly, less questions were asked for Project Hard (1.1 questions on average), which was clearly much more complex than Project Easy (1.8 questions until proposal).

When candidates did engage in their question time, they rarely raised questions we expected. Many questions were concerned with technicalities such as how a JSON string should be formatted.

There were at least five crucial topics that simply must have been raised for the candidate to prepare an informed fixed-price proposal — at least so we thought…

Table 1: List of topics we expected candidates to raise.

Table 1: List of topics we expected candidates to raise.

In other words:

  • 97% applicants committed to a price without knowing anything at all about the business context of the project — was it a proof of concept for a wannabe startup? Or a strategic project for a global bank where compliance and security requirements will easily triple the scope?
  • 97% applicants committed to a price without understanding whether they’re supposed to build a lightning-fast, real-time, mission-critical system or a nice-to-have background job that can run on a spare server over the weekend.
  • 97% applicants committed to a price without clarifying who provides servers for model training and who pays for them.
  • 72% applicants committed to a price not knowing whether the system is supposed to understand only English or also Mandarin, Hindu and Arabic.
  • 91% applicants committed to a price without any idea whether they would be dealing with one megabyte of data per week or 10 terabytes per hour.

Such a blatantly irrational behaviour cannot be explained in purely economic terms nor can it be explained by candidates’ inexperience. There has to be a psychological phenomenon that causes otherwise smart and experienced people to commit such dangerous crimes against logic.

Daniel Kahneman offers an explanation with “what you see is all there is” bias, described in detail in his book Thinking, Fast and Slow. Kahneman has long studied two relatively separate “circuits” of human thinking — System 1, which is fast, effortless, inaccurate, full of biases and often triggered under stress, and System 2, which is slow, effortful, logical and requires attention.

According to Kahneman, “The measure of success for System 1 is the coherence of the story it manages to create. The amount and quality of the data on which the story is based are largely irrelevant. When information is scarce, System 1 operates as a machine for jumping to conclusions.” [1] Jumping to conclusions was exactly what our candidates did. We can easily see how WYSIATI affected their thinking:

  • Overconfidence — the fact that candidates were able to estimate the projects with so little information is by itself remarkable. We will discuss in detail how this affected the resulting time and cost estimates.
  • Framing effects — we framed our job postings as technical problems and that’s exactly how the candidates treated them, ignoring the entire business context, misunderstanding of which not only leads to wrong solutions, but also to catastrophic scoping errors.
  • Base-rate neglect — English is Upwork’s lingua franca and our job postings were naturally written in English. But that doesn’t change the fact that there is a maximum of 1 billion English speakers (including non-native speakers) in the world [3] where 6.6 billion (or 87%) therefore don’t speak English at all. Yet, mere existence of other languages didn’t occur to 9 out of 10 candidates.

System 1 takes over whenever we don’t pay attention, lack of which was strongly suggested in our ice cream test. When we add stress to the mix (see Upwork’s omnipresent notions of how many other people applied for the same project, how miraculously cheap they were and how many of them are already being interviewed), we can conclude that there are strong psychological reasons for our candidates to behave the way we observed.

Proposals

Table_2.png

Table 2: Cost estimates received.

The real curse of WYSIATI lies in its damaging business consequences. Consider the following table of cost estimates we received.

Because each candidate inevitably priced the projects with her own assumptions in mind, we got 390x difference (!) between the lowest and highest cost estimate for Project Hard. None of the candidates ever quoted us server costs.

Looking at candidates’ hourly rates, median was significantly higher for Project Hard. This may seem like a nice correlation (the project is indeed much harder, so it attracts better experts), but that would be a dangerous shortcut. The only conclusion we can draw from the fact at this point is that Project Hard attracted candidates who were more self-confident.

chart_3.png

Table 3: Candidate’s hourly rates as stated on their Upwork profiles.

We wanted to see if cost estimates somehow correlate with candidates’ “monetary” metrics — hourly rate (which signals confidence) and lifetime earnings (which signals proven experience on the platform). One of our hypotheses was that even despite WYSIATI and lack of questions, experienced candidates would agree on the price more between themselves because they would converge on a “typical” scenario based on their long experience, as opposed to beginners who have only seen a relatively few projects in their careers. We were wrong — the most experienced candidates actually disagreed the most!

figure_1.png

Figure 1: Cost estimate vs. hourly rate.

figure_2.png

Figure 2: Cost estimate vs. total earnings.


To reduce the potential biasing effect of rates on cost estimates, we also examined time estimates, expressed in hours. Obtaining them wasn’t as straightforward as one would expect as some candidates weren’t able (or willing) to commit to a certain number of hours, despite having committed to a fixed price at an hourly rate visibly stated on their profiles — another flash of irrationality, or some sort of psychological game that we weren’t able to recognize. In any case, for these candidates we simply did the math ourselves.

The data doesn’t look much different:

figure_3.png

Figure 3: Time estimate vs. hourly rate.

figure_4.png

Figure 4: Time estimate vs. total earnings.

table_4.png

Table 4: Time estimates.

Being practical, how would you decide who to hire based on such data? How would you know who offers a good value for money? And what is suspiciously cheap? Would you look at medians? What if the problem is too complex for the “wise crowd” to comprehend and only a handful of top experts can estimate it? What if, on the other hand, it’s a trivial problem and people are just trying to take advantage of you?

My primary feeling looking at this data would be (and has been, actually): frustration. There’s literally nothing I would be able to do with such data as a manager — and I would genuinely love to talk to someone who would. But I can offer a different view.

Healing the Curse

If you’re still interested in hiring a machine learning expert on Upwork, you need to accept the market inefficiency and be ready to collect information yourself. You need to create an environment where candidates can perform their best and think logically. And above all: you need to invest time.

Here are four practical steps that will make your experience substantially smoother:

  1. Change the dynamic. Release the stress and enable focus. Make it clear that you’re looking for a fruitful debate and the best solution, not an ASAP proposal or the lowest price (if you are, I strongly recommend you to take a day off and do the math — there’s a strong economic rationale for hiring the best — and inevitably more expensive — developers). If possible, hire the best candidates for a short (and paid) test drive.
  2. Make sure you know all the right questions upfront, because unfortunately, no one is going to ask them. If you’re not a technical person, get an advisor you trust — you absolutely need your personal devil’s advocate. Over communicate everything to candidates and make sure they fully understand. Frame your posting as a business problem and provide context.
  3. Develop a simple rating framework to determine candidate’s quality yourself. There’s no use in looking at hourly rates, earnings, or badges, and there’s even less use in relying on your gut feeling. For scientific justification, read how Kahneman redesigned a recruitment process in Israeli Defence Forces from “completely useless” to “moderately useful”[2]. I intentionally refer to the original book instead of a random blog post (there are many) because the full context is important.
  4. Go beyond Upwork because soft skills and social proofs are much better expressed elsewhere. Talk to the candidates on video. Ask for GitHub, LinkedIn and StackOverflow profiles.

Conclusion

If there were a single most important takeaway from this experiment, it would be: psychology matters. In the startup world, we tend to focus on technology and often forget that human psychology has the same non-linear effects on our business — and sometimes in the least expected hideaways throughout the value chain.

Special thanks to Eduardo Cerna for helping me collect and crunch the numbers, and to Carlos Dreyfus, Michaela Mrázková, Dominik Pavlov and Daniel Kovařík for their helpful feedback.

References

[1] Kahneman, Daniel. Thinking, Fast and Slow. 2011. Penguin, 2012, pp. 85.

[2] pp. 229–233

[3] “English-Speaking World.” Wikipedia, https://en.wikipedia.org/wiki/English-speaking_world

Share this article: linkedin

Read More Articles

99% Accuracy = 99% Lie

Plain English intro to your first AI project: why the idea of accuracy is dangerous and why 99% is probably not the number you need.