Thursday, December 4, 2014

Making the City Safer - One Big Data at a Time...

“The question I had when I came in was, Do we sit on our hands waiting for crime to tick up, or can we do something to drive crime lower?” (NYTIMES, 2014) With limited resources and a never-ending caseload, NY District Attorney Cyrus Vance Jr. turned to data to improve the criminal just process in NYC. He used a version of the Pareto principle (the 80-20 rule) to find out which smaller subset of the bad guys were causing most of the crimes.
In speeches praising intelligence-driven prosecution, Vance often cites the case of a 270-pound scam artist named Naim Jabbar, who for more than a decade made a living in the Times Square area bumping into pedestrians and then demanding money, saying they had broken his glasses. Convicted 19 times on the misdemeanor charge of “fraudulent accosting,” Jabbar never served more than five months in jail until he was flagged by the C.S.U. His next arrest, in July 2010, triggered an alert. Instead of being offered a plea bargain, he was indicted and subsequently convicted on a felony robbery charge, and sentenced to three and a half to seven years in prison. With time served before his conviction, he was soon paroled and then arrested again, in July 2014, for another broken-eyeglasses incident and charged with robbery and grand larceny. (NYTIMES, 2014)
This is the simple genius of data driven decision making - learning to do more with less. If there are 10 bad guys, but 1 of them is a really bad guy - go after him first. After enough time, the repeat offenders are gone, locked up - and your resources are freed up a bit to deal with the real issues. Word spreads that it's more likely for repeat offenders to face real jail time and that acts as a disincentive to bad behavior. As Karen Friendman Agnifilo states in the article, "There’s a reason murders in Manhattan went from 70 in 2010 to 29 so far this year. We figured out who are the people driving crime in Manhattan, and for four years we focused on taking them out.

Big Data 1, Bad Guys 0!

Thursday, November 13, 2014

Resistance is Futile... (and frequent)

Stonehill College is working to develop the institutions next strategic plan. As part of that process, several groups were formed to delve into specific areas of the plan. Recently, I've been participating in the Student Experience Group - the group looking at engagement on-campus, engagement off-campus, keeping students connected to the community, student retention, and the transformative nature of a Stonehill education. The group is considering which software might be effective in these efforts. How do we plan a student's experience and make sure they are exposed to ideas like internships, community service, and study abroad? How do we communicate as a community when we notice a student in trouble, either academically or physically? How do we quickly identify students who are academically at risk?

Naturally, one of the solutions we're studying is Blackboard Analytics for Learn. Stonehill recently converted to Blackboard 9 since our previous CMS was end-of-lifed. Included in the contract was Blackboard Analytics for Learn, a powerful analytics reporting tool which allows insight into class behavior. For instance - how often are students logging in? How does the behavior of 'A' students compare with that of 'B' students overall? One of the exciting aspects of this reporting is that, with enough participation by faculty, you could do away with the old mid-term grade process. If faculty are using Blackboard to track quizzes, tests, participation, etc. and using it for grading - Blackboard Analytics would let you see which students are in trouble. Not only that, analytics lets you see trends - so which student is struggling in a single class vs. which student is struggling in all of their classes? Since the analysis is based on real-time data, you could see a student who suddenly stops participating 2/3 of the way into a semester and direct resources to them. The mid-semester grade process is point-in-time, what are the students grades on October 15, for example. But what happens if the student has a personal crisis on October 18? Analytics gives you the flexibility to spot the trouble and try to remedy the situation.

However, there is some unexpected resistance. People, particularly faculty, are concerned with the idea that 'Big Brother' is watching. If you can use the tool to see student participation, it's not difficult to imagine looking to see which faculty are behind on grading, or don't give students adequate feedback. While we've always used the reporting tools we have responsibly and ethically - I can understand their concern. There is also the perception that education is becoming too business like. Some educators find the idea of processing data too impersonal or corporate.

In IT, my job is not just to install software, but to calm and educate our users. I truly believe these tools can transform our environment and provide a automatic safety net for students in trouble. My job is to convince people that while their concerns are valid, we are ethical custodians of the data. I need to show them that the benefits to faculty, to College planning, and especially to our students are worth the risk.

Thursday, October 30, 2014

Our Biggest Existential Threat is...Google?

A while back I was walking through the parking lot at the Derby Street shops in Hingham and I saw my first, real-life Tesla. My timing was impeccable, as I gawked at the car a very pleasant older woman approached - it was her car - and we started a conversation. She gave me a complete tour of the car, what was great about it, why she bought it, the works. Toward the end of our conversation I mentioned that I thought Elon Musk was the successor to the recently departed Steve Jobs - a brilliant mind that married technology and design and appeal and might just change the world. I'll never forget what she said, "Oh, I think Musk is going to make Jobs look like a piker." (a piker is a gambler who only places small bets or someone who does things in a small way) While I didn't necessarily agree, I loved the way she phrased it - and I've been watching Elon Musk more carefully since.

So - what does this have to do with big data? I'm getting there. Clearly, Musk is not a luddite. He's involved in solar energy, electric cars, reusable rockets, and plans to bring humans to Mars to name a few. When Vanity Fair interviews Musk and he warns,“I don’t think anyone realizes how quickly artificial intelligence is advancing, particularly if [the machine is] involved in recursive self-improvement … and its utility function is something that’s detrimental to humanity, then it will have a very bad effect.”

He continued with the more dire warning: “If its [function] is just something like getting rid of email spam and it determines the best way of getting rid of spam is getting rid of humans …”

I know it's easy to dismiss this as the ravings of a lunatic - but he's not, not really. He's a visionary - a man who embraces the future, who is leading us there. But he's also a realist - and he realizes that poorly written sentient code could be our undoing. In a followup article with Computerworld, Andrew Moore - Dean of Computer Science at Carnegie Mellon offered, "At first I was surprised and then I thought, 'this is not completely crazy,' I actually do think this is a valid concern and it's really an interesting one. It's a remote, far future danger but sometime we're going to have to think about it. If we're at all close to building these super-intelligent, powerful machines, we should absolutely stop and figure out what we're doing."

So now you think either a) I'm a loon or b) what does this have to do with Google? Let us, for a moment, assume I'm not a loon. When you consider what it would take for sentient machines to cause problems it doesn't seem that intimidating. You could just pull the plug, right? Withhold energy, and no matter how malicious they were, they would just cease. Some little factory in Arizona becomes self-aware - we just cut the power, cut the internet, and wait for the problem to resolve itself. Here's where it gets interesting. What if it was one of the world's largest global resources that began to learn?

[duh duh daaaahhh]

Here's where Google comes in (finally). Last year Google acquired Deepmind Technologies, a London-based artificial intelligence firm. A recent BetaBeat article suggest that Google is interested in using this investment to have computers begin to program themselves. Which brings us back to the title. Say it's not a factory in Arizona that becomes self-aware. Instead, consider it's the Google network, or all of IBM's networks and servers taken over by a malicious Watson of the future. If these massive big-data projects begin to think and learn and adapt - are we letting the genie out of the bottle?

Wednesday, October 15, 2014

Don’t Let Perfect Be The Enemy of the Good!

The definition of irony includes 'an outcome of events contrary to what was, or might have been, expected'.  Take, for example, a group of people who use bicycles to offset the modern convenience of cars because they care about the environment and/or their health.  Ironically, they use technology (in the form of the Strava app) to navigate in a world that's not always welcoming.  The result?  The city of Portland (Oregon, not the much cooler Maine/New Hampshire one) "licensed a Strava metro data set of 17,700 riders and 400,000 bike trips around Portland. That adds up to 5 million BMTs (bicycle miles traveled) logged in 2013 alone"  (source: TheVerge).  The city will use that data to make bicycling safer in the city, managing infrastructure changes based on riding patterns.

There are some privacy concerns, but the company says the data has been "anonymized" and users have the ability to opt-out.  The city and the company admit the data isn't perfect, but my favorite line from the article (and an aphorism I quote often) - don't let perfect be the enemy of the good.  Simply, the city has an opportunity to begin shaping policy based on quantifiable data.  Is the data perfect?  No.  Is the data skewed (toward smartphone owners)?  A bit.  Can design decisions consider the data on 5 million miles of bicycle travel in a single year.  Absolutely!

Big data can be scary, creepy, Orwellian?  Big data can also improve the pulse of a city, and make it a little safer for bicycles to commute downtown.

Tuesday, October 7, 2014

Don’t Fall in Love Just BECAUSE It’s Big Data

In the YouTube video ‘The Data Scientist’s Toolbox’, one of the salient statements for me was “Be careful not to fall in love with it just BECAUSE it’s Big Data“.  This week I came across Kaleev Leetaru’s ForeignPolicy.com article Why Big Data Missed the Early Warning Signs of Ebola.  Mr. Leetaru begins with media reports citing Harvard’s HealthMap program picking up early reports of a mystery hemorrhagic fever 9 days before the World Health Organization.  He goes on to state that HealthMap was simply picking up on tweets and retweets of a newswire article (in French) reporting on a press conference held by the Guinea Department of Health.  Further, he criticizes the program for not being able to read anything but English.  He’s missing the point.

To begin with, there have been at least 10 outbreaks of Ebola in the past 10 years, so early detection of this disease was more of a test case for the software.  I don’t think anybody was surprised by an Ebola outbreak in March when the Harvard program detected chatter.  Certainly this outbreak has become a major new story with global implications, but in March – it was just another Ebola story.

What I think Mr. Leetaru is really missing is that despite it catching reporting of official statements – the program worked!  The news was detected, the system mined data from multiple sources and detected something unusual.  Could it be improved upon?  Certainly.  Is translation a missing component?  No doubt.  Still, think of the success of the program.  Health officials at the CDC could be notified days before the WHO report was able to work it’s way through bureaucracies.  Virologists who specialize in hemorrhagic fevers could be notified and placed on alert or begin communicating with colleagues in African countries.

The title Why Big Data Missed the Early Warning Signs of Ebola is misleading.  It didn’t miss the signs, it picked them up – it just happened that it picked up on official, remote, regional, page 52 below the fold news items – and in this, it succeeded.  They Harvard team is doing ground-breaking work.  If it’s not perfect yet we cannot fault them – imagine what their software will be doing in 2 years.  in 5?  In 10?  They are creating a system that will one day (in the not distant future) link what seem like unrelated news stories into the beginnings of new epidemics.  They will provide researchers invaluable data on when and where outbreaks began, helping to more quickly locate patient 0s and determine the source of infections.

Big Data didn’t miss the signs, it just wasn’t the first one to see them.  One of these days – it will be.

Thursday, September 25, 2014

Visualizing Data

Hans Rosling is one of those people, those magnificent geniuses who see the world differently - and want to force the rest of us to see things from his perspective.  He is a 'statistics guru' who uses data to truly understand the world.  Understanding data is one thing, getting average people to understand what you've spent an entire career trying to figure out - that's no easy feat.  Dr. Rosling is a showman - giving his audience on a visual roller coaster ride through data.  His presentations are fun, and the visual presentations are both convincing and easy to understand.  He draws his pictures so clearly that  it's difficult to argue with his conclusion.

Hans Rosling: The best stats you've ever seen | TED Talks

We can use Dr. Rosling's techniques to more clearly and cleanly explain data.  For instance, reporting Admission trends or current headcount, many business analysts would present a complex spreadsheet to decision makers.  That data is great, and when they have time - that might just be the kind of details they're looking for.  Often times, senior managers are busy and simply wants a digestible bite of data.  In these cases a dashboard provides quick access to decision-making data in the way key stakeholders can use daily to understand the current state of data.  They say, "a picture is worth a thousand words" - visual data provides that same kind of compression and gives us the tools to understand data without having to understand all the underlying business rules.



Thursday, September 18, 2014

You've Got the Data - Now What?

Collecting and aggregating data is a fairly straightforward process.  Yes, you need the servers, the databases, the storage - but these days the storage infrastructure is less expensive and easier to build than ever before.  Cloud storage, and even cloud clustered computing solutions make it reasonable for anyone to store massive amounts of data and the horsepower to crunch that data.  But then what?  What do you do with all that data you've collected?

One approach is to turn to the crowd.  Instead of having 10 sets of eyes looking at it, figure out a way to segment the data, define rules for what you're trying to see, and distribute the task of looking to possibly billions of Web users.  The Chronicle of Higher Education profiled one such attempt - Alexander S. Szalay challenged the status quo on sharing astronomy data.  He began by stitching together observations from multiple telescopes to provide a clearer picture of the sky, but he realized he needed more eyes looking at all the data he was collecting.


Mr. Szalay's experience led to the creation of Galaxy Zoo, a digital catalog of images from the Sloan Digital Sky survey.  A brief tutorial teaches visitors what to look for and multiple independent observers must agree on the classification of an image to include it.  When the site opened in 2007 so many people used it the servers actually overheated and caught fire and more than 270,000 people have signed up to classify galaxies so far.  One even found a highly unusual object, so significant the Hubble telescope was tasked with observing it.


Collecting and aggregating data is fairly straightforward process.  Getting a thousand, ten thousand, or a hundred thousand sets of eyes searching data is brilliant.  Give users a brief primer and let them perform triage on the data.  The volunteers don't need to know everything - they just need enough to find things that are out of place.  Those objects can be passed to the trained scientists to name, study and dissect.  Opportunities abound - anywhere you can build a simple, straightforward search rule - you can tap into the dynamic horsepower of countless human minds to look for patterns, saving time for the professional researchers to look at candidate objects.  Win!


Source: Chronicle article - http://chronicle.com/article/The-Rise-of-Crowd-Science/65707/