Author Archives: Jason

About Jason

Jason R. Koepke is Founder and Data Strategist at GNT LLC, a risk-analysis and data strategy firm that provides analytical and technical services to the public and private sectors. His work and research has been featured in the academic, financial, and technical industries.

Wikipedia Headings as a Cross-Temporal Data Set

Many Wikpedia entries contain headings and sub-headings that have temporal connotations. Due to the 2.0 and cross-temporal character of Wikipedia, these headings and sub-headings are ripe for data mining. The question I have is what can we learn from these type of datasets about how humans understand time.

Yeah, yeah, time is a social construct. So is everything. The point is to understand that social construct and how it is constructed. Wikipedia offers a great way of doing so, because the temporal markers are adjusted as extra-Wikipedia time marches. In other words, even as we grow older at a “constant” rate, the temporal markers for a Wikipedia adjust and re-adjust. For example, the Wikipedia article about Wiley has several headings include:
* 1997-2003: Early Years
* 2004-2010: Solo Success
* 2011-present: Recent Work

It’s not difficult to imagine how these three breakdowns of the temporal landscape of his career has evolved since his debut (i.e., it is unlikely that these headings appeared as soon as he made it). So how do these headings evolve over time and what events lead us to impose such periods?

I don’t have an answer, but I sure would like the answer.

Things I Learned This Week

Among the things I learned this week:
* Spotify doesn’t allow you to hide your existence/profile. (Courtesy: Spotify)

* North Korea had its first Ultimate Frisbee tournament. (Courtesy: North Korea Economy Watch)

* Google Alerts doesn’t work with filetype searches. (Courtesy: Google Alerts)

* The United States and Canada prohibit gay men donating blood, and a host of other countries have behavioral restrictions on whether a gay man can donate blood. (Courtesy: NYT).

Breakdowns of Healthcare Data Breaches

On the side, I’m working on a data-breach project with a specific focus on IT-related problems. A major part of any analysis I do is separating truth from hype. And hype is a major part of the reporting done on data breaches, particularly when the breaches are due to hacking or security issues. Using HHS data on healthcare-related breaches, the hype–OMG, when hackers strike!–quickly is separated from reality–how frequent and what percentage of data breaches are hacking/IT security related. I use this as a starting point, because it allows the various parties to understand the issue at hand and develop a properly balanced risk-mitigation strategy (i.e., spend money where it counts).

At first glance, it is striking how consistent data breaches of healthcare data are (see below). One might think that data breaches are increasing, perhaps non-linearly due to reporting and how we magnify reports with our fears.



Furthermore, this consistency is found in cyber/IT/hacking-related breaches, an area where we might suspect an increasing number of incidents due to the rebirth of major hacking groups, cloud computing, and increasingly shared medical records. As the graphic below makes clear, a feared increase in cyber-related breaches is not the case.

Last, if we look at the breaches by type (and, here, it is useful to have the data in front of you and manage the data to make proper analysis more insightful), we learn that improving record-handling procedures would lead to significant improvements, as it’s a combination of lost records and managing access to said records that lead to the most breaches. A more helpful analysis, which I’ve done, shows which attacks leads to the most records being breached (this is important for notification purposes and cost calculations, but not in how to prevent breaches from occurring), etc.

Deeper analysis of this data allows companies to properly insure data breaches, allocate rational/reasonable resources to mitigate the different types of breaches, and evolve their data handling and breach response policies over time.

A couple notes:
* Data is courtesy of HHS. This does not include unreported breaches. Due to state-reporting requirements, reporting may be biased or of varying qualities. The data could be normalized/adjusted/tweaked to provide a more clear picture, but the untouched data is useful in a few ways.

* Graphics and data handling is courtesy of Palantir Government software.

* If you or your company would like more detailed analysis, with this or other similar data, reach out to me (jasonkoepke a gmail . com).