Skip to content

Data and me

January 19, 2008 pm31 8:56 pm

Not from TNG.

When I was a kid, maybe 7 or 8? there was a summer I discovered baseball boxscores, and the leader lists. The lists were long in the Sunday paper, with RBIs, HRs, batting average. But in the daily paper we just got these little boxscores, and the leader lists didn’t get updated. Today computers ensure that when a guy hits a triple, the boxscore tells you how many that makes for the year, automatically updates his slugging percentage, his batting average. But this was back in the day (for values of “day”close to the early 1970s).

Eduwonkette and Leo found problems in the Dept of Education’s class size data. I should jump in.

I liked to look at how long the games were, and if anyone did exceptionally well. I could spend an hour a day. My stepfather, I think, pointed out that if the pitcher was good, the game was more likely to go fast. I started looking for the relationship, started trying to predict which games would go less than 2 hours, based on the pitchers’ records. I wasn’t particularly good, nor was I bad. I had no idea of the limitations of the data I was given (ie, found in the paper), and didn’t really know how to improve, except by a rough guess and check.

more –>

I liked the lists of leaders, especially batting average. Home runs were easy, just see if anyone in the top ten hit one. But batting average, aha, that’s where the magic of math played a role. If a player was 23/80, and went 3/4, what happened to his average? It went from .288 to 26/84 = .310 (with rounding, these are always rounded to 3 digits). But the daily paper didn’t usually say that the guy was 23/80, just that he was .288. So how could a kid recalculate batting averages each day (while his mother yelled at him to go out and play?) He could either keep running totals (which I did not), or he could discover some secret math involving long division (which I did). I know, early in the season, that .255 means 12/47 or 13/51 or 14/55, and that nothing else is possible. Better, I know that .302 means 13/43 or 19/63, with no reasonable alternates, but .221 has lots of possibilities: 15/68, 17/77, 19/86, 21/95, 23/104 with even more. And this work accomplished – nothing useful. But I practiced lots of long division, and gained a feel of how numbers worked together.

Fast forward to my first real job, with a public agency. I did lots of things, but data analysis was my specialty. Stupid reports. Compiling numbers. Formatting, displaying them. But also discovering things. A private company supplied monthly ‘data.’ It came in two main categories, further divided into 20 smaller categories. They also reported the overall data segmented into one major type and two minor types. Month after month they supplied data, and I summarized and reported it. But I also played with it. Back to my boxscore days, I noticed that the ratios between some of the subcategories seemed ‘consistent.’ I found a way to strip away one of the minor types of data, and discovered that the major type was not real data, but a monthly grand total (probably a real number) allocated amongst the categories by formula. You know the faked x-rays in The China Syndrome? Sort of the same thing, but with no lives at stake.

I got better at finding lies in data. And I got good at lying with numbers myself. Every year or so a major decision would come up, which my agency and several others would weigh in with opinions on. The opinion would essentially be ‘yes’ or ‘no.’ My boss or my boss’ boss, or even one level up would ask for help from the analysts. Someone would tell me if the conclusion should be yes or no, and ask me to work up the data to support that conclusion. The same process, I assumed, happened in the other agencies that were providing comment. And I caught lots of them. I could dig through reports and find the hidden assumptions, the use of partial data, the relevant data that they were ignoring – I was pretty good. But my assumptions, hidden deep in very reasonable-looking models – I never got caught. The better I got, the more disgusted I became, which helped motivate me to leave.

Ok, so let’s wrap this up. I have had major running disagreements with Leo Casey about major issues involving public education, testing, accountability, and what the UFT should be doing? I’ve always been right, and he’s always been wrong (just ask me). He’s been wrong about small schools in New York City, and he’s been wrong about me (I am not anti-small school, I am against the lousy ones we’ve got). But two weeks ago he published something on class size data, and he did a great job with some data. Credit where it’s due, he found a hidden pattern in the City’s numbers. (essentially, the City underfunds large schools).
Eduwonkette did a great job, too.

And me? I did nothing. But it means I need to dust off my boxscore and forensic-number-lying skills, and see if I can’t extend their work and dig up more.

needs further analysis

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: