OKCupid Data Leak – Framing the Debate

You’ve probably heard by now that a ‘researcher’ by the name of Emil Kirkegaard released the sensitive data of 70,000 individuals from OKCupid on the Open Science framework. This is an egregious violation of research ethics and we’re already beginning to see mainstream media coverage of this unfolding story. I’ve been following this pretty closely as it involves my PhD alma mater Aarhus University. All I want to do here is collect relevant links and facts for those who may not be aware of the story. This debacle is likely going  become a key discussion piece in future debates over how to conduct open science. Jump to the bottom of this post for a live-updated collection of news coverage, blogs, and tweets as this issue unfolds.

Emil himself continues to fan flames by being totally unapologetic:

An open letter has been formed here, currently with the signatures of over 150 individuals (myself included) petitioning Aarhus University for a full statement and investigation of the issue:

https://docs.google.com/document/d/1xjSi8gFT8B2jw-O8jhXykfSusggheBl-s3ud2YBca3E/edit

Meanwhile Aarhus University has stated that Emil acted without oversight or any affiliation with AU, and that if he has claimed otherwise they intend to take (presumably legal) action:

 

I’m sure a lot more is going to be written as this story unfolds; the implications for open science are potentially huge. Already we’re seeing scientists wonder if this portends previously unappreciated risks of sharing data:

I just want to try and frame a few things. In the initial dust-up of this story there was a lot of confusion. I saw multiple accounts describing Emil as a “PI” (primary investigator), asking for his funding to be withdrawn, etc. At the time the details surrounding this was rather unclear. Now as more and more emerge it seems to paint a rather different picture, which is not being accurately portrayed so far in the media coverage:

Emil is not a ‘researcher’. He acted without any supervision or direct affiliation to AU. He is a masters student who claims on his website that he is ‘only enrolled at AU to collect SU [government funds])’. I’m seeing that most of the outlets describe this as ‘researchers release OKCupid data’. When considering the implications of this for open science and data sharing, we need to frame this as what it is: a group of hacktivists exploiting a security vulnerability under the guise of open science. NOT a university-backed research program.

What implications does this have for open science? From my perspective it looks like we need to discuss the role oversight and data protection. Ongoing twitter discussion suggests Emil violated EU data protection laws and the OKCupid terms of service. But other sources argue that this kind of scraping ‘attack’ is basically data-gathering 101 and that nearly any undergraduate with the right education could have done this. It seems like we need to have a conversation about our digital rights to data privacy, and whether those are doing enough to protect us. Doesn’t OKCupid itself hold some responsibility for allowing this data be access so easily? And what is the responsibility of the Open Science Foundation? Do we need to put stronger safeguards in place? Could an organization like anonymous, or even ISIS, ‘dox’ thousands of people and host the data there? These are extreme situations, but I think we need to frame them now before people walk away with the idea that this is an indictment of data sharing in general.

Below is a collection of tweets, blogs, and news coverage of the incident:


Tweets:

Brian Nosek on the Open Science Foundations Response:

More tweets on larger issues:

 

Emil has stated he is not acting on behalf of AU:


 

News coverage:

Vox:

http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release?utm_campaign=vox&utm_content=chorus&utm_medium=social&utm_source=twitter

Motherboard:

http://motherboard.vice.com/read/70000-okcupid-users-just-had-their-data-published

ZDNet:

http://www.zdnet.com/article/okcupid-user-accounts-released-for-the-titillation-of-the-internet/

Forbes:

http://www.forbes.com/sites/emmawoollacott/2016/05/13/intimate-data-of-70000-okcupid-users-released/#2533c34c19bd

http://www.themarysue.com/okcupid-profile-leak/

Here is a great example of how bad this is; Wired runs stury with headline ‘OKCupid study reveals perils of big data science:

OkCupid Study Reveals the Perils of Big-Data Science

This is not a study!  It is not ‘science’! At least not by any principle definition!


Blogs:

https://ironholds.org/blog/when-science-goes-bad-consent-data-and-doubling-down-on-the-internet/

https://sakaluk.wordpress.com/2016/05/12/10-on-the-osfokcupid-data-dump-a-batman-analogy/

http://emilygorcenski.com/blog/when-open-science-isn-t-the-okcupid-data-breach

Here is a defense of Emil’s actions:
https://artir.wordpress.com/2016/05/13/in-defense-of-emil-kirkegaard/

 

How to reply to #icanhazpdf in 3 seconds

Yesterday my friend Hauke and I theorized about a kind of dream scenario- a totally distributed, easy to use, publication liberation system. This is perhaps not feasible at this point [1]. Today we’re going to present something that will be useful right now. The essential goal here is to make it so that anyone, anywhere, can access the papers they need in a timely manner. The idea is to take advantage of existing strategies and tools to streamline paper sharing as much as possible. Folks already do this- every day on twitter or in private, requests for papers are made and fulfilled. Our goal is to completely streamline this process down to a few clicks of your mouse. That way a small but dedicated group of folks – the Papester Collective – can ensure that #icanhazpdf requests are fulfilled almost instantly. This is a work in progress. Leave comments on how to improve and further streamline this system and join the collective!

SHORT VERSION: HOW TO GET A PAPER BEHIND A PAYWALL QUICKLY

Tweet (for example): “#icanhazpdf http://dx.doi.org/10.1523/JNEUROSCI.4568-12.2013

Click: Here you can find more detailed instructions.

HOW TO JOIN THE COLLECTIVE AND START SERVING REQUESTS

SHORT INSTRUCTIONS AND REQUIRED SOFTWARE:

  1. Twitter: Monitor #icanhazpdf #requests
  2. Zotero and zotero browser plugin: after clicking on DOI link or abstract page just click on ‘Save to Zotero’ button to auto-grabs PDFs

  3. Zotfile: automatically copies new Zotero pdfs files saved to public Dropbox folder

  4. Dropbox: Cloud storage system to seamlessly share files with anyone without login.

  5. Dropbox linker: automatically adds links from public folder to your clipboard

  6. Reply to request tweets: paste URL from clipboard and if you want #papester

That’s it! Now you can just click request links, click the Zotero get PDF button, and CTRL+V a dropbox direct download link in response!

Click: Here you can find more detailed instructions.

1.The fundamental problem: uploading huge repositories of scientific papers is not sensible for now. It’s too much data (50 million papers * 0.5-1.5 megabytes together make up ~ 25-75 Terrabytes) and the likelihood for every paper to be downloaded is more uniformly distributed than with files traditionally shared like music. For instance, there are 100 million songs x 3.5 mb songs, and it is difficult to find exotic songs online – some songs have decent availability now because there are only a few favourites – not so with favourite papers. Also, fewer people will share papers than songs, so this makes it more even more difficult to sustain a complete repository. Thus, we need a system that fufills requests individually.

Disclaimer: Please make sure you only share papers with friends who also have the copyrights to the papers you share.