Laura started her talk by showing some simple visualizations and talking about the difficulties of reading graphs. She showed Artemis, searching for words “circumstantial” and “information” over time. She then compared it to the Google NGram viewer. She talked about the problems with the NGram viewer like shifts in characters (from f to s) around 1750. Dirty OCR makes a difference too. She showed a problem with Artemis having to do with the dropping out of a dataset. Artemis has a set of datasets, but not all for all time so when one drops out you get a drop in results.
Even when you deal with relative frequency you can get what look like wild variations. These often are not indicative of something in the time, but indicate a small sample size. The diachronic datasets often have far fewer books per year in the early centuries than later so the results of searches can vary. One book with the search pattern can appear like a dramatic bump in early years.
There are also problems with claims made about data. There is a “real world” from which we then capture (capta) information. That information is not given but captured. It is then manipulated to produce more and more surrogates. The surrogates are then used to produce visualizations where you pick what you want users to see and how. All of these are acts of interpretation.
What we have are problems with tools and problems of data. We can see this in how women are represented datamining, which is what this talk is about. She organized her talk around the steps that get us from the world to a visualization. Her central example was Matt Jocker’s work in Macroanalysis on gender that seemed to suggest we can use text mining to differentiate between women and men writing.
World 2 Capta
She started with the problem of what data we have of women’s writing. The data is not given by the “real” world. It is gathered and people gathering often have biased accounting systems. Decisions made about what is literature or what is high literature affect the mining downstream.
We need to be able to ask “How is data structured and does it have problems?”
Women are absent in the archive – they are getting erased. Laura thinks these erasures sustain the illusion.
Capta 2 Data or Data Munging
She then talked about the munging of data – how it is cleaned up and enriched. She talked about how Matt Jockers has presented differences in data munging.
The Algorithms
Then she talked about the algorithms, many of which have problems. Moritz Hardt arranged a conference on How Big Data is Unfair. Hardt showed how the algorithms can be biased.
Sara Hajian is another person who has talked about algorithm unfairness. She has shown how it shows prestigious job ads to men. Preferential culture is unfair. Why Big Data Needs Thick Data is a paper that argues that we need both.
Laura insisted that the solution is not to give up on big data, but that we need to keep working on big data to make it fair and not give it up.
Data Manipulation to Visualization
Laura then shifted to problems with how data is manipulated and visualized to make arguments. She mentioned Jan Rybicki’s article Vive la différence that shows how ideas about writing like a man and like a woman don’t work. Even Matt Jockers concludes that gender doesn’t explain much. Coherence, author, genre, decade do a much better job. That said, Matt concluded that gender was a strong signal.
Visualizations then pick up on simplifications.
Lucy Suchman looks at systems thinking. Systems are a problem, but they are important as networks of relations. The articulation of relations in a system is perfomative, not a given. Gender characteristics can be exaggerated – that can be the production of gender. There are various reasons why people choose to perform gender and their sex may not matter.
There is also an act of gender in analyzing the data. “What I do is tame ambiguity.”
Calculative exactitude is not the same as precision. Computers don’t make binary oppositions; people do. (See Ted Underwood, The Real Problem with Distant Reading.) Machine learning algorithms are good at teasing out loose family resemblances, not clear cut differences and one of the problems with gender is that it isn’t binary. Feminists distinguished between sex vs. gender. We now have transgender, cisgender … and exaggerated gender.
Now that we look for writing scales we can look for a lot more than a binary.
Is complexity just one more politically correct thing we want to do? Mandell is working with Piper to see if they can use the texts themselves to generate genders.
It is also true that sometimes we don’t want complexity. Sometimes we want simple forceful graphics.
Special Problems posed by Visualizing Literary Objects
Laura’s last move was to then looked at gender in literary texts and discuss the problem of mining gender in literary texts with characters. To that end she invoked Blakey Vermeule, Why Do We Care About Literary Characters? about Miss Bates and marriage in Austen’s Emma.
Authors make things stand out in various ways using repetition which may through off bag-of-words algorithms. Novels try to portray the stereotypical and then violate it – “The economy of character.
Novels are performing both bias and the analysis of bias – they can create and unmask biases. How is text mining going to track that.
In A Matter of Scale, Jockers talks about checking confirmation bias to which Flanders replies about how we all operate with community consensus.
The lone objective researcher is an old model – how can we analyze in a community that develops consensus using text mining? To do this Laura Mandell believes we need capta open to examination, dissensus driving change, open examination of the algorithms and then how visualizations represent the capta.