I happened upon some Pew online dating survey data that seemed worth exploring, so I explored it, and it was! But before I get into what I discovered, I should say a little about the data’s structure, and the questions my exploration centered around. The data’s 167 columns represent survey questions; its 2,252 rows represent American adults interviewed in 2013. The data comes with a separate Word document, a script used by interviewers, describing the survey questions, in detail. Reading this script, certain questions jumped out at me. I was particularly interested in the who, where, and why of online dating, in addition to whether there’s any stigma associated with it. One more point worth making: Rarely, respondents either refused to answer, or said they didn’t know the answer to, certain questions. My plots are of people who could, and did, answer the question. Ok … Let’s dive into the data!
I knew, to answer any “who” questions, I needed to be able to group respondents into demographics. So, my first order of business was to explore the distributions of certain demographic variables. I started with age:
Here I see a bimodal distribution, with Baby Boomers on the right, and their children on the left. The left peak is shorter than the right, so it looks like the Boomers didn’t replace themselves. Also, the distance between the peaks tells me roughly how long the Boomers waited to have children.
Next, I looked at gender, race, and sexual orientation:
I wasn’t expecting whites or straights to be such large majorities, but the population surveyed was, demographically, highly representative of the U.S. population - I googled it! Being wary of small sample sizes, I consolidated minorities to create the race and sexual orientation variables plotted above.
Then I looked at education:
… and annual household income ($1,000’s):
The education distribution looks random; the income distribution looks more uniform. My unsurprising takeaway? Education does not guarantee income. The size of the lower income and poor segments of the population is striking though.
Lastly, I checked out relationship status, and whether someone had tried online dating or not:
So, I created a variable for the left plot by putting respondents who were either married, living with a partner, or in some other kind of committed romantic relationship in one bucket, and respondents who were not in a committed romantic relationship in a second bucket. I was surprised to find that only about 11% of internet users had tried online dating. I was curious how this percentage might change looking at different subsets of the internet-using population, so I dug deeper, starting with gender:
I found out that males were slightly more likely than females to have tried online dating, although female online daters ever so slightly outnumbered their male counterparts.
Then I looked at age:
The people who had not tried online dating tended to be older than those who had. A wide range of ages were present in both groups, but the “No” group was more spread out. I was expecting the median age in the “Yes” group to be way less than 41. Were Boomers forcing up the median?
I wanted to know, more precisely, what age respondents were most likely to have tried online dating, so I plotted some frequency polygons …
… and found that it was most common in the 30 to 40-year-old range.
Then I started wondering about relationship status and age, which led to more frequency polygons:
What I see here - other than the fact that we’re born alone, and die alone - is people settling down in their late 20’s nowadays. Are 30 to 40-year-olds turning to online dating because they feel pressured to settle down? Most people their age are committed. Do they find meeting singles offline harder than it was in their early 20’s?
Food for thought. Let’s check out race:
So, whites were slightly more likely to have tried online dating than non-whites. I did find a spike within the minority group: 16% of mixed race respondents said they had tried online dating.
Then I plotted faceted histograms to look at income and education, but the plots didn’t feel right. The two variables seemed more quantitative to me, and it was hard to grok how the rate of online dating changed with increasing education or income, looking at 8 or 9 histograms. Plus, I suspected a dependence between the two variables, so I wanted a plot that would showcase them together. In the end, I decided to just plot the percentage of people who had tried online dating across income/education groups:
The two curves sort of follow each other probably because education level tends to dictate income. But, as we saw in the one-dimensional plots, education doesn’t guarantee income, so it makes sense that the curves don’t line up perfectly. Interestingly, income seems to be a better predictor of whether you’ve tried online dating or not.
Anyway, I see that respondents with an annual household income between 40 and 75 thousand dollars were most likely to have tried online dating. But why? How could I explain the plot’s mysterious peaks and valleys? My first impulse was to look at age and income:
As we saw looking at online dating and age, the median age across groups is high, again, probably due to the overrepresentation of Baby Boomers. Also, the higher income groups skew older, and, in terms of age, we saw that young adults were most likely to have tried online dating. Could the low levels of online dating in higher income groups have to do with the age of those groups?
Maybe there were fewer singles in the higher income groups too. But before I looked into that, I wanted to simply know if singles were more likely to have tried online dating …
… and, indeed, they were. I’m guessing online dating wasn’t really a thing back when some of the committed respondents were “playing the field.” It’s worth noting that 28% of divorced respondents, a subset of the uncommitted population, said they had tried online dating.
Ok. Now let’s look at the percentage of people who were in some kind of committed romantic relationship across income groups:
Wow! The prevalence of committed respondents increased almost linearly across income groups. I guess that makes sense, thinking about joint incomes: respondents were probably more likely to have had higher household incomes if they were in committed relationships. And maybe there’s some truth to the claim that the number one thing couples fight about is money. Are poorer couples more likely to split up? Or maybe the rich are just more sought-after. Regardless, the fact that high income groups tended to be older and more romantically unavailable made their inexperience with online dating seem reasonable. But wait a second. If middle income respondents tended to be older and more romantically unavailable than low income respondents, why weren’t the latter more likely to have tried online dating? My guess is that online dating, and 40 to 75 thousand dollar annual household incomes, are more common in big cities.
Ok. As much as I’d love to fall deeper down the rabbit hole that is the education/income plot, I have my sanity to consider. Let’s move on to online dating and sexual orientation:
Online dating was significantly more common among LGBT respondents than among straight respondents. But I was curious whether the same tendencies I saw in the general population would hold true in the LGBT group. So I took a look-see, again, starting with gender:
Interesting. In the general population, men were more likely to have tried online dating, but in the LGBT group, women were. Why? Bisexuals, the most populous subset of the LGBTs, were, surprisingly, 74% female, and 29% of female bisexual respondents had tried online dating. Among gay respondents, men were again more likely to have tried it.
Ok. On to age:
So, we see some of the same things we saw in the general population plot, namely that the “No” group tended to be older than the “Yes” group, and that the “No” group was more spread out. But what makes this plot stand out is how low the median ages are. Were older closeted respondents not identifying as LGBT?
About race:
Again, online dating was more common among whites than non-whites, but the frequency differences between the two groups were a little more pronounced here than they were in the general population. Interestingly, among gay respondents, non-whites were more likely to have tried online dating. Is it harder for non-whites to come out as gay, and if so, did that translate to less offline dating opportunities for these respondents? In general, I’m guessing it’s harder for people in the LGBT community to find people to date because they’re such a dispersed minority: concentrated LGBT communities seem rare offline.
Let’s look again at the percent of online daters in each income/education group:
What jumps out at me here is how the curves line up. LGBT respondents with a given level of education tended to get paid less than equally-educated respondents in the general population. Wage gap aside, we see the same sort of W-shaped curve we saw in the general population plot, but here we see a global maximum in the low income range.
What about relationship status?
Nothing surprising here. Online dating was more common among uncommitted respondents, by about the same amount we saw in the general population (10%ish). But then I wondered how common committed romantic relationships were in the LGBT community, so I looked at relationship status and sexual orientation:
So, LGBTs were more likely to be uncommitted, which may, at least partially, explain their higher use of online dating sites.
Ok. If you’re all who’d out at this point, that makes two of us.
Let’s look at which sites online daters had tried, and the gender ratios of those sites:
I see two major players here: 75% of respondents who had tried online dating used either Match.com, OkCupid (a subsidiary of Match.com), or eHarmony. Six of the nine sites mentioned by respondents had been tried by women mostly. Of the top two sites, eHarmony shows a significantly higher female to male ratio.
Now let’s look at how strongly respondents related to five reasons for online dating:
I once read, in an OkTrends post, how women tend to prefer men of their own race. People look for themselves in others. We see that here in the reasons respondents most related to: meeting people who share similar interests or hobbies, and meeting people who share the same beliefs or values. Surprisingly, interests trumped values. I would have expected the latter to be more important to someone in the market for a long-term partner. Aren’t interests more prone to changing over time? But hold on. Only 43% of respondents cited a long term relationship as a major reason they used online dating, so maybe it’s not that surprising. But then why did so many respondents turn to sites geared toward long-term relationships, like eHarmony or Match.com? Maybe some were conflicted. The plot does imply that some respondents cited both a long-term, and a casual, relationship as reasons. Actually, 55% did.
Now let’s look at all the respondents again, not just the online daters, and let’s see whether they agreed or disagreed with four statements about online dating:
So, we’re pretty sure online dating isn’t a bad thing, but we’re more split on whether it’s a good thing. Is this at least partially why so many people hadn’t tried it? What about the fact that only 18% of uncommitted respondents said they were looking for a romantic partner? Ok, but why hadn’t more committed respondents tried online dating? Surely it had been an option for more than just the 6% of them who had tried it. Were they unsure about online dating because they needed more social proof?
Let’s look at how respondents fared when asked if they knew anyone who had: 1. tried online dating, and 2. found a committed romantic partner online dating:
On the left, we see that most respondents didn’t know any online daters. But 11% of internet-using respondents had tried online dating, and most Americans live in urban areas where online dating is probably even more common. Wouldn’t that make the presence of at least one online dater in most social circles pretty likely? Were people keeping their online exploits to themselves? On the right, we see that significantly fewer respondents knew someone who had found a committed romantic partner online dating. Could this explain why more of the respondents didn’t agree with the positive statements about online dating, and, hence, why more of them hadn’t tried it? Or maybe they’d been swayed by online dating horror stories …
Let’s look at how online daters responded, by gender, when asked if they’d ever been contacted through a dating site in a way that made them feel harassed or uncomfortable:
So, most online daters said they’d never had such an experience, but women said “yes” significantly more often.
Now let’s look at whether online daters, again by gender, had ever felt like someone seriously misrepresented themselves in their profile:
This seemed to be a more common experience for both sexes, but, again, women tended to say “yes” more often. Are we seeing a downside of anonymity here? Hmmm …
So, yes, most online daters had had negative experiences, but maybe most were also able to eventually find what they were looking for. Let’s try and get a better idea of how useful online dating is, in terms of helping people find committed romantic partners. To this end, I created a new variable by putting all the respondents in a committed romantic relationship, who had met their partners online dating, in one bucket, and those who had met their partners some other way, in a second bucket. I wanted to sort of reproduce a study done in 2010, by Match.com, which said: “17% of couples married in the last 3 years met each other on an online dating site.” So I did. I also looked at how many respondents in a committed romantic relationship, who had tried online dating, met their partner doing so:
On the left, we see that significantly less than 17% of respondents married between 2007 and 2010 met their spouses online dating, and I wonder if this discrepancy has to do with breakups, which, according to a recent study done by Aditi Paul, at Michigan State University, are more common among couples who met online. On the right, we see that most online daters who ended up in a committed romantic relationship didn’t find their partners online dating. Splitting these respondents up based on where they’d online dated, I saw that, even sites with the highest success rates - Match.com (36%) and “Other” sites (44%) - weren’t that effective. So, were negative experiences biasing respondents, or forcing them to hang up their online dating hats altogether, and pursue partners elsewhere? Or maybe there’s a sort of adverse selection process at play. Maybe online dating sites tend to attract the less relationship-savvy, and that contributed to the higher offline success rate. Just spitballing here.
Ok. So, we previously looked at how likely it was for respondents to have tried online dating, given their marital status. Now let’s look at the reverse. Let’s look at how likely respondents were to be committed, based on whether they’d tried online dating:
So, the percentages we see here can also be calculated using Bayes’ Rule: all the numbers we need, we’ve seen, but I digress. This plot says that respondents who had tried online dating were significantly more likely to not be in a committed romantic relationship. I guess this makes sense, remembering that only 43% of online daters said they were serious about finding a long-term partner. But I think psychologist Barry Schwartz might have hit the nail on the head, in his book, “The Paradox of Choice,” where he makes the following pertinent points: 1. Choice causes paralysis: obviously, the more choices we have, the harder it is to make a decision. 2. If we do make a decision, we end up less satisfied: with more options, it’s easier to imagine having made a better choice. 3. The more choice we have, the higher our expectations become, which leads to less satisfaction, even when the result is good.
I couldn’t help but try and build a classifier that could predict if someone was an online dater or not. A few words on my attempt:
So, before I did any machine learning, I considered how I might mitigate my data scarcity problem. A breakthrough came when I discovered that Pew researchers, to track responses over time, had compiled two datasets using essentially the same questionnaire: the 2013 data I had explored, and a second dataset created in 2005, yet untouched. Merging the two datasets introduced an additional 46 online dater examples. I also discovered 25 respondents in the 2013 data who used online dating apps exclusively, and added them to the online dater group. Note that the exploration chronicled above focused on the use of dating sites only. Anyway, despite being successful extracting additional online daters, non-online daters remained heavily over-represented. It became clear that this was a handicap, inspecting the performance of early classifiers, which had high accuracy scores, but zero recall and precision in terms of the online dater class. These classifiers were trained on folds of all examples of both classes, but by limiting the number of non-online daters in training sets, I was able to produce models that could classify both classes correctly most of the time.
What did I learn building these models? Most importantly, I found new answers to the question of who had tried online dating. How? Obviously, I had to choose a set of features that didn’t track too closely to the target, which could ultimately be used by a classifier to make predictions. I ended up choosing features related to gender, age, race, income, education, sexual orientation, relationship status, internet use, location, parental status, and social proof. I was able to gain insights into which of these features most differentiated online daters from non-online daters using univariate feature selection (SelectKBest). What I found was that questions having to do with marital status and whether respondents knew anyone who had tried online dating had the most predictive power.
Here’s a simple decision tree (F1 = 0.75, Accuracy = 0.72) illustrating the two highest-scoring questions at work:
Actually, this is more of a decision tree-slash-heat map, with darker colors signifying higher non-online dater concentrations. Note that if a respondent was married, odds are they hadn’t tried online dating, especially if they didn’t know any online daters; if a respondent wasn’t married, they probably had tried online dating, particularly if they knew someone who had. So, social proof seems like an important factor in people’s decision to try online dating, and will probably continue to be. But the question of whether someone is married may lose power over time, as married couples who met before online dating was an option become more and more rare. Then again, it may not: there may tend to be no need for online dating for those geared toward sustainable marriages. The question of whether this model, or any of the Pew data’s implications, will hold up over time is crucial. Merging the 2005 and 2013 datasets relied on the assumption that online daters and non-online daters have fixed attributes. If and how the makeups of these groups change over time would be worth exploring, but, alas, I must move on.
Thank you for reading this far.