December 29, 2013

Yeezy 2 Analysis - Spotting Fakes With Data (Part 3 - The Line between real and fake)

Now that we know the Red Octobers are dropping on December 27th still don’t know anything about anything, it’s time continue our analysis of using data to spot fake Yeezys.  If you missed Parts 1 or 2, or need a refresher, here are links:

Part 1 Recap:  There are lots of fake Yeezy 2s on eBay –> We need to eliminate them from our data

Part 2 Recap:  There are three groups of Yeezys for sale:  Open Variants, Disguised Fakes, and Real –> We need to distinguish between them in the data . . .

. . . which brings us to Part 3 and the question:  Where is the line between real and fake Yeezys?

Specifically, when looking at the Yeezy 2 histogram, where do the fakes end and the distribution of real sneakers begin?  A quick way to visually estimate the cutoff is to place a normal distribution curve around the center of the real group and see where it stops fitting the data as it nears the fake group.  (We’re assuming that the price distribution for real Yeezys is normal.)

Yeezy 2 histogram v2 w curve

Here we can clearly see that the curve bisects the $750-$1,000 range, indicating that – at least statistically – some percentage of those sneakers are real, and some percentage are fake.  In order to better estimate which ones are which, we can refine our visual exercise and take a more granular view of the data by using buckets of $100 increments, instead of $250.

Yeezy 2 histogram 100 w curve

For reference, the way to interpret buckets in this view is that the $1,500 bucket, for example, represents all sneakers which were sold for a price between $1,401-$1,500.  A refitting of the curve appears to have moved slightly to the right, now centered around the $1,701-$1,800 bucket (whereas before it was centered around $1,501-$1,750).  Our first insight, then, is that the average price of a real Yeezy might be closer to $1,750 than $1,625.

With regard to identifying fakes we can see that the first bucket the left tail of the curve bisects is now the $1,001-$1,100 bucket – as opposed to the $751-$1,000 bucket, previously.  This suggests that the fake cutoff is actually higher than we originally thought, perhaps indicating where the “Disguised Fakes” are.  A full segmentation of the histogram now looks like this:

Yeezy 2 histogram with sections v2

Using this new information and new picture, we can set the minimum at any value in the $1,001-$1,100 bucket and feel comfortable:  $1,100 will eliminate all sneakers in this bucket as fake; $1,050 will eliminate approximately half of the bucket, which makes sense because the curve crosses the middle of the bucket; or $1,001 will include all of this bucket.

So we just make an assumption, choose a minimum – let’s say $1,050 to play it safe – and move on, right?  Not so fast.  Before we chisel it in stone, let’s explore the other issues.

WARNING:  THE NEXT FOUR PARAGRAPHS ARE FOR DATA NERDS ONLY (you know who you are).  Otherwise, feel free to skip to “End of Warning”

Adding to the complexity is the fact that while the curve bisects the $1,001-$1,100 bucket first, it also bisect four other buckets.  This implies that there are real and fakes in multiple buckets – and that makes sense.  In reality, the fake Yeezys are likely quite spread out and pressing into our real distribution (i.e., “Disguised Fakes”).  A clever seller (who is not concerned about his reputation) might even price fakes higher so that they give the impression of being real.

One way to manage this would be to rank order our data points (from largest to smallest) in each bin that is bisected, and call “fakes” everything that falls beneath the median of each bisected bin.  Similarly, we would be accepting that everything above that point is “real”.  But here we would run into the issue of what it means to have “overlapping distributions”, and this illuminates the different types of errors:  false positives (calling a fake Yeezy real) and false negatives (calling a real Yeezy fake).

Returning to our visual analysis, we can see that although the curve crosses five buckets, it crosses the first three buckets very low (near the x-axis) indicating that the probability of a sneaker from that bucket being real is very low.  We can feel confident, then, setting the cutoff above these buckets.  A somewhat more statistical approach to a problem like this might involve making a decision about our tolerance level for falsely believing a shoe is real, and then defining our estimate of where those fake shoes are on the price distribution using a similar method to how we identified our real distribution.  This way we could make statistical decisions about what is our probability of having a real vs. fake by comparing the proportion of shoes that are likely to be in either distribution.  If we have a bin that is totally overlapping the two distributions, then we could define our probability of getting a real shoe as 50/50 (i.e., a toss-up).  If this is starting to sound complicated, that’s because it is – so we’ll just agree to eliminate the first three buckets.

. . . but if you’re curious about the issue, a similar problem is illustrated by trying to know if you have a male or female person if the only thing I told you was their height.  These are highly overlapping distributions, where being 5’8″ might be a toss-up.  By the look of our Yeezy histogram, we are in a much better place than height-based gender identification, but the problem is the same.  We have overlapping distributions.  Our confidence of purity for either distribution goes up in the extremes, and goes down in the middle.  We could quantify that in a pseudo-statistical way by using the curves to give us a probability score.  There are quite sophisticated ways of getting that probability score right, but a quick way could be to generate weighted votes for fake / real on each score in a particular bin, and then generate a probability score by adding up the p(Real) / (p(Real) + p(Fake)).  If I hadn’t lost you before, I’m sure I have by now.  But that’s the point. There are many complicated statistical ways we can test this, but for our purposes, we simply need a reasonable cutoff number – a minimum for choosing which Yeezys to include in our data analysis, and which to exclude.

END OF WARNING:  BACK TO CONTENT SUITABLE FOR ALL

As explained above, there are many complicated statistical ways we can test this, but for our purposes, we simply need a reasonable cutoff number – a minimum for choosing which Yeezys to include in our data analysis, and which to exclude.  Using $1,001, $1,050 or $1,100 would all be reasonable.  But the curve crosses the $901-$1,000 bucket about 1/3 of the way up, and this bucket has a relatively large number of sneakers in it, so perhaps the minimum should fall within this bucket.  (Note:  Logically it makes sense that this bucket has many sneakers in it because sellers will often price their Yeezy at exactly $1,000.  In fact, 65 of the 118 in this bucket were exactly $1,000.)  We can feel comfortable setting the cutoff at $1,000 knowing that we will be accepting all sneakers in the $1,100 bucket as real, as well as all Yeezys priced at exactly $1,000.  Visually this feels right because of where the curve crosses each of these buckets – it looks like we’re accepting half of the “Fake or Real?” section.  And, in fact, a quick look at the numbers shows that with a minimum of $1,000 we would be including 152 pairs and excluding 142 pairs.  Makes sense.

The minimum cutoff is therefore $1,000 (inclusive).  Every Yeezy which sold for less than $1,000 will be eliminated from our data analysis.

The maximum cutoff is $3,250 (inclusive).  Every Yeezy which sold for more than $3,250 will be eliminated from our data analysis.

In the interest of time we’ve chosen not to document how we arrived at the maximum number, but rest assured it was calculated using a similar methodology and analyses as the minimum.  The purpose of setting and using a maximum is to eliminate auctions for multiple pairs (i.e., an auction for both pairs of Yeezy 2s).  If the price of the Yeezy increases dramatically in the future (and we have the ability to track data by month), then we might consider increasing the maximum.   But for now, we’ll use $3,250.

We have a minimum and a maximum but if you’ve read this far you’re surely asking:  How is this related to trying to cop the Red Octobers?  Frankly, it just confirms what most of us already know:  If the deal seems good to be true, if the price is too low, it’s almost definitely a fake.  The only way you’re gonna get a real pair of Yeezys for less than four figures is if you’re one of the very few lucky ones to cop it online from Foot Locker you are Kevin Hart.

Nike-Air-Yeezy-2-Red-October-FWillen

Illustration by Felix Willen

We apologize if you thought Campless could impart some great wisdom or insight that would help you cop a pair.  Unfortunately, no amount of data analysis can change the extravagant imbalance between supply and demand which exists in the physical world.  Of course, all bets are off if our prediction comes true and Nike makes the Red October a general release to flood the market and kill the cache of owning Yeezys as payback for the Kanye breakup.  Barring that, however, we hope you’ve enjoyed the small consolation prize – some tidbits of knowledge about outliers, data analysis and how we exclude fakes from our statistics.

Have a better way to deal with fakes?  Interested in more details about our process?  Ask in the comments.

In conclusion:  The fourth and final installment of the Yeezy 2 Analysis, when we will re-calculate the Yeezy 2 data points (average price, volume, etc.) using the minimum and maximum, and compare that data to the stats we calculated in Part 1 before excluding outliers.