Review of Flynn’s What is intelligence? by Paul F. Ross
For those who need an evaluation of a book in the first few sentences of a review, know that Flynn’s report is important, impossible to read and understand for anyone but the most dedicated professional psychologist specializing in the measurement of intelligence, and misleading. It is not the first, nor is it likely to be the last, of books on this topic that is misleading and seriously without foundation for some of its work. At just over 200 pages in a small-page format, it’s short. It works on an important
Flynn, James R. What is intelligence: Beyond the Flynn Effect 2007, Cambridge University Press, Cambridge UK, xi + 217 pages
task, a terribly important task, measuring and understanding the implications of human intelligence. There are moments when the book is thoughtful and insightful, but the non-psychologist will find those moments difficult to distinguish from the majority of its meanderings when the author thinks, loosely, sometimes erroneously, without the necessary anchors in broadly conceived empirical observations about intelligence and operations for measuring intelligence. If an introduction to the century-long struggle to understand human intelligence and its consequences led you to Flynn’s title and this review, you may find the background provided in this review of more help than Flynn’s book itself.
The public’s discovery of racial differences in intelligence
Herrnstein (Harvard University) and Murray (Johns Hopkins University) published The bell curve (1994) which treated consequences of general intelligence for a large, representative sample of students in the US, ages 14 to 22, who had been selected in 1979 and repeatedly followed for the next fifteen years (at which time Herrnstein and Murray accessed their data), collecting information about their life’s events … some having reached middle age by the time The bell curve was written and published. The bell curve reported correctly that many unwanted life’s events happened to those whose intelligence had been measured as low at the time the students entered the study in 1979. Those of low (measured) intelligence generally failed to enter and complete college, experienced high rates of unemployment, became single parents, needed a variety of public services, and committed crimes and were jailed, all these unwanted outcomes being experienced at rates higher than for those whose intelligence was higher. I read The bell curve in 1994, knew I was reading something important, and invited fellow townsfolk in the privileged town of Lincoln, Massachusetts to gather in my living room for several weekly discussions of the book. The book immediately leaped to nationwide notoriety through newspaper reviews and commentary because the book pointed clearly to the increased likelihood of these unhappy life circumstances occurring for American blacks. Gould of Harvard reviewed the book, his reviews appeared in newspapers and magazines, he all but calling The bell curve racist. A few years later Gould continued his campaign against the measurement of intelligence in his The mismeasure of man (1981). Herrnstein and Murray, with Gould’s help, had made Americans very conscious of and sensitive to racial differences between the average of intelligence test scores and between measures of average school achievement. Buried in Herrnstein’s and Murray’s The bell curve was a report about James R. Flynn’s work showing that IQ (intelligence quotient) scores for succeeding generations of children are going up. These were not small gains in scores. Pushed backwards to grandparents’ generations from a current assumed average IQ of 100 for adults, the implications of the “Flynn effect” suggest that most everyone’s grandparents were borderline mentally retarded! Were subsequent generations getting more intelligent? There was no answer. Herrnstein and Murray labeled Flynn’s observation “the Flynn effect.” Having read Herrnstein’s and Murray’s report with care in 1994, I must have seen their comment on “the Flynn effect,” but I made no note of it then. It made no impression on me. I was alerted to the Flynn effect during recent reading, purchased Flynn’s book, and am reviewing it as is my decade-long habit for each book that I read.
The meaning of racial differences in intelligence
We can and do classify individuals using racial and ethnic terms such as white, black, Asian, Hispanic, native American, German, Irish, Swedish, Italian, and the like. The classifications become less and less meaningful as life today sees large increases in “interracial marriages” and children of “mixed” racial and ethnic backgrounds emerge from these marriages. Studying “race” through the lens of DNA examination, we find that we’re all of mixed race. When we assemble intelligence test scores “by race,” we observe differences between the averages. Those differences are real. They are “statistically significant.” However, when one examines and understands the wide variation in intelligence within a “race” or an “ethnic group” as one looks at one individual and then the next, for Race A and Race B, ethnic Group F and Group G, the differences between the averages fade into being inconsequential. They are overwhelmed by the variety of intelligence within each group. Pursuing “group differences,” there also are differences between the averages of men and women, but they too fade into being inconsequential as one understands and appreciates the wide variation among individual women and, likewise, among individual men. The excitement about interracial differences in mean intelligence test scores is driven not by the practical importance of those differences (for there are no large practical differences as many aspects of human performance teach us if we will but look) but by emotions and habits of thought with respect to race … or gender … or sexual preference … or Italians and Greeks and Chinese … or any other means for classifying individuals that we are taught by our cultures to consider as important. What, then, do we do with these emotions? One thoughtful response is “We try to understand what ‘intelligence’ means … how it is measured … whether it is measured so that any individual, no matter his/her background, has an equal opportunity to show his/her stuff (city kids are not asked questions about cows; country kids are not asked questions about street numbers) … whether those measuring instruments remain ‘well calibrated’ as time and society move from one decade to the next.” Presumably a reader, picking up Flynn’s What is intelligence?, intends to move toward answering these questions. However Flynn does not address the issues at the level of these very important background interests!
The reader needs to know, among other things, the history – about a century old – of the measurement of intelligence.
One of the earliest measures of intelligence was accomplished by Binet in France in the early 1900s. Building upon Binet’s work, Terman and others produced the Stanford-Binet test in the US. The Army Alpha, an intelligence test that could be given to tens or hundreds of individuals simultaneously, was built for use with military recruits in WWI. Since then the Wechsler tests, and many, many others including the Scholastic Aptitude Test (SAT), have been built. Some intelligence tests are designed to be given to just one person at a time, the test administration requiring the full focus of a test administrator. These include the Stanford-Binet and the Wechsler tests. Other tests are designed so they can be given to tens or even hundreds of individuals simultaneously as well as given to just one individual at a time. The Army Alpha and the SAT are of that kind. Flynn’s research, leading to his description of what came to be called “the Flynn effect,” used several versions of the Wechsler tests as his main source of data. Binet’s work was begun about 1900 so that school students could be separated into different classrooms, based on their IQ as measured by Binet’s test, where daily tasks for the slow learners, average learners, and fast learners could be used. That helped teachers capture and keep the attention and interest of all the students in a classroom rather than wrestle with the problems, within a single classroom, of maintaining the interest of the fast learners while the slow learners were practicing the basic skills.
A National Longitudinal Study of Youth was launched in the U.S. in 1979, gathered information about youths ages 14 to 22 at that time, and followed them regularly in subsequent years. The Armed Services Vocational Aptitude Battery (ASVAB) was used with these youth. The survey work and database maintenance were done under leadership by researchers at The Ohio State University. The ASVAB included the Army’s intelligence test and it was that measure from which Herrnstein and Murray prepared their study of the life-outcome implications of intelligence as reported in The bell curve.
There had been many other prior studies of intelligence and student achievement. James Coleman published his study of the Equality of educational opportunity in the U.S. (1966) nearly three decades before Herrnstein’s and Murray’s The bell curve (1994). It was my privilege to participate in a seminar reviewing the newly published Coleman study in the fall of 1966 at Harvard University. Our seminar having been convened by the Dean of the School of Education at Harvard, it was led by Daniel Patrick Moynihan (later U.S. Senator from New York) and Thomas Pettigrew, had about forty attendees including the Chayes from Harvard’s Law School, Arthur Jensen who has published frequently on intelligence in the meantime, Christopher Jencks who considered Who gets ahead? (1979) in America, and the well known statistician Alexander Mood who flew to Boston from Washington DC each time the seminar met in order to attend our meeting. We met at the Harvard University Faculty Club, were served dinner, enjoyed conversations with our fellow seminarians, then settled down to serious discussion about Coleman’s report. With Moynihan’s height and loud voice dominating pre-dinner chatter (recall Moynihan’s 1969 recommendation to President Nixon, three years later, about the need to treat race issues with “benign neglect”), fueled by Moynihan’s love for Irish whiskey, one sometimes wondered whether the seminar could get down to serious and coherent thought. Racial differences in school achievement by students throughout the United States were prominent findings from the Coleman study. High performance in tests of knowledge by Asians, Unitarians, Jews, and others along with low performance on the same tests by American blacks, native Americans, Hispanics, and kids from poor urban and rural neighborhoods were prominently reported.
Discovering why those differences in educational achievement were there, as prominently and convincingly revealed by Coleman’s study, became and has remained a major challenge to education in America.
I was the only attendee in that Moynihan-Pettigrew seminar in 1966 (among the forty or so attendees) who submitted a memorandum to the seminar. I recommended a factor analysis of Coleman’s data about schools. Mayfield (I think it was Mayfield) of the National Center for Educational Statistics did such a study after our seminar had adjourned. I was convinced that the general intelligence of the student populations for each school affected the students’ performance on achievement tests and sought more information from Coleman’s data on whether additional resources (more teachers, better teachers, more money for schools, etc.) affected students’ achievement as independent inputs to the educational process after having controlled statistically for the intelligence of the students. The study was done. Mayfield overlooked the opportunity to send me a complimentary copy of the report and I’ve never read it!
In the four decades since the Coleman report, with several boosts toward equalizing opportunity in education from the U.S. Supreme Court, U.S. educators and other professionals have addressed the matter (a) by carefully researching instruments for measuring intelligence to make sure they avoid biases in the selection of material about which to be quizzed in an intelligence test, (b) by providing Head Start schooling opportunities for children in economically deprived settings, (c) by busing students “across town” if necessary to give each school a set of students that racially represents the local region, (d) by carefully equalizing the educational resources available in each school under one school board (resource equalization is not always easy to accomplish when local school districts commonly have the opportunity to set their own tax rate for raising revenues for local funding of public schools), and (e) by hiring policies for teachers and school staff that emphasized racial diversity.
In short, the consequences of intelligence differences that Herrnstein and Murray reported in 1994 were not new findings. Coleman had reported the racial discrepancies in educational achievement three decades earlier. Herrnstein and Murray, using the National Longitudinal Study of Youth begun in 1979, were able to show a variety of life’s consequences that follow for individuals finding themselves at different places on the bell curve with respect to intelligence, the intelligence measure having been observed at about eighteen years of age or earlier. Jencks (1979) had added to the reports of life outcomes (as influenced by backgrounds and intelligence) in Who gets ahead? published fifteen years after the Coleman study and fifteen years before Herrnstein’s and Murray’s The bell curve (1994).
How is intelligence measured?
Intelligence is measured in essentially the same way any “knowledge” known to any individual is measured. The examiner (or test) asks a question. The person being tested answers the question. The answer is determined to be either “right” or “wrong” by following rules for “scoring” the answer established at the time the question was framed by the test builder. A full examination is accomplished by asking many questions usually beginning with questions most people will answer correctly (easy questions) and then proceeding to the questions for which fewer people know the answers (the more difficult questions). One’s “score” on the test (in most instances) is determined by counting the number of right answers. Some intelligence tests use more complicated scoring procedures. The test-taker’s IQ (intelligence quotient) is then determined by comparing the number of right answers supplied by this test taker with a database showing the number of right answers individuals of the same chronological age as the test taker have accomplished when they took the same test. If the test taker gets a score that is average as compared with the “norm group,” s/he is assigned an IQ of 100. If the test taker gets a score one standard deviation lower than the average for the norm group among those of the same age as the test taker, the test taker is assigned an IQ of 85. If the test taker gets a score two standard deviations above the average for the norm group who are of the same age as the test taker, s/he is assigned an IQ of 130. In effect, IQ is defined by the number of questions answered correctly by the test taker after the test taker’s score has been compared with the scores of many individuals of the same age who participated in the norm group. Thus an intelligence test is a test of knowledge, knowledge of a kind presumably “everybody” has had an opportunity to learn by the time they reached the age of the test taker, after which measurement the test taker’s score is compared with the scores of a norm group.
What is a norm group? For tests in the U.S., the “norm group” usually is a group with equal numbers of males and females that have been randomly selected from the U.S. population. Builders of intelligence tests take great care to find and test large numbers of individuals who, in combination, represent “the population of the United States” and so form a U.S. norm group for the test. The norm group needs to be “like” the U.S. population as a whole with respect to gender, age, race, ethnic background, religion, political persuasion, preferences for favorite sports, health, settings in which they have lived … and so on and on although, as you can guess, few norm groups are checked for their “representativeness” with respect to all these variables.
It is immediately obvious when constructing a test that six year olds have a different vocabulary than do eighteen year olds, so questions to be used in the intelligence test have to be fashioned to sample knowledge that has in some sense been accessible to the test taker in the test taker’s lifetime and life circumstances. For the test results to be accurate and for the opportunity for the test taker “to show her/his stuff,” the questions must be chosen to test knowledge that the test taker has had the opportunity to acquire.
The foregoing, then, is the core framework describing how intelligence testing is done and how IQ scores are determined. It does get a bit more complicated, of course. Should the questions be about the meaning of words? Yes. Should they require the manipulation of numbers? Yes. Should they present geometric figures, rotate those figures, and expect the test taker to understand and report what rotation has been done? Yes. Should the test taker’s score depend on how many squares into which s/he can put three pencil dots in fifteen seconds? Well, perhaps. Should questions be asked for which the right answer depends on an understanding of social niceties? Well, perhaps. Should the test taker be asked whether s/he ever crosses the street to avoid meeting someone? Well, perhaps. Which, then, is the “right” answer: “I do,” or “I don’t” ? Is it fair to ask the test taker whether s/he has ever drawn pictures or words on a public wall facing the street? Which, then, is the “right” answer: “I have,” or “I haven’t” ? Is it fair to ask “Are all cows black?” when some test takers may never have seen a cow? As you can see, “intelligence” soon has many different facets. Some test takers are very good with words, less able with numbers. Some easily understand rotating geometric figures while others understand them less well. Some understand social niceties while others do not. Thus “intelligence” is sometimes described as having a verbal component, a numeric component, a geometric (or spatial) component, an interpersonal or emotional component, and so on.
Very early in the history of intelligence testing, Spearman (by 1903) described the fact that some test takers got high scores for most of the different kinds of questions while others tended to get low scores regardless of the question’s content … this when each individual appears to have given full attention and intent to answering all the questions correctly. The fact that scores on many different types of questions were positively correlated was said to describe “general intelligence” or simply “g.” In the history of educational and psychological measurement from 1900 to the present time, no “measure” of an individual’s capabilities is as accurate (as valid) in forecasting life’s outcomes (school performance, performance on the job, difficulties occurring in a lifetime such as going to jail, etc.) as is “g.” Many people dislike this “fact of life,” resisting the notion that “intelligence” is a characteristic having a large influence on life’s outcomes, but a fact of life it remains whether we like it or not. Gould (1981) wrote a book describing how “measures of man” are being misused. Measures of intelligence or measures of aspects of personality can and are being misused. But they also can be used properly to the immense advantage of individuals, families, the economy, and societies worldwide. Gould’s work (1981) has not contributed to a balanced view of intelligence, of the importance (or lack of importance) of differences between the average scores of racial (and other) groupings, of the proper use of intelligence measures (and other measures) of individual human performance, or of the benefits that follow from the proper use of “measures of man.”
Where does Flynn’s book fit in this story?
Flynn (1984) began pointing to the ever improving scores on subtests of the Binet and Wechsler tests. Youths tested in 1945, their scores then compared with norms gathered in 1945, received relatively accurate IQs. Youths tested thirty years later, in 1975, using the 1945 version of the test, their scores then compared with norms gathered in 1945, earned higher scores and therefore received higher IQs. Were the youths tested in 1975 “more intelligent” than the youths tested in 1945? Flynn thinks not. In early chapters in What is intelligence? (2007), Flynn argues that cultural changes occurred between 1945 and 1975. Television, non-existent in 1945, became an important part of most families’ lives by 1975. Children and youth were exposed to a larger variety of sources of information in 1975 than were available to the youths of 1945 and therefore learned many things that youths of 1945 had no opportunity to learn. They learned more skills and more background items of information. They could answer more questions correctly for the Binet and Wechsler subtests. Using the 1945 norms, those tested in 1975 received higher scores and therefore higher IQs. Flynn’s arguments explaining the “Flynn effect” are appealing. They may be correct. Since the growth of IQ scores is about 0.3 IQ points per year, Flynn argues that an IQ based on a 1945-version of a test taken in 1970 and then compared with the 1945 norms to get IQ should be corrected using the following calculation: “IQ-correct-for-1970 = IQ-based-on-1945-norms – ((1970 – 1945) x 0.3).” In effect, the IQ based on 1945 questions and 1945 norms should be decreased by 0.3 of an IQ point for each year intervening between the year the norms were established and the year the test taker was tested.
Read the above paragraphs again and again until you understand them. Nowhere in Flynn’s What is intelligence? will you find as clear and succinct an explanation of the “Flynn effect,” its causes, and what to do about it (all this according to Flynn) as is presented in the above paragraphs. That points to a major problem for the book. Flynn does not know what audience he is writing for and, even if he did, he would not know how to communicate with the audience.
One possible audience for this book is the audience of professional psychologists who use the Stanford-Binet and Wechsler tests to establish IQs for the youths and adults that they test. Flynn does not write for them. He shows only the most general of empirical findings describing the “Flynn effect,” explains those findings using (almost) only words when tables and graphs and descriptions of procedures and correlations of “corrected” IQs and “uncorrected” IQs with other variables are needed to convey information to his audience and convince them that Flynn has a handle on important truth and is recommending procedures that are essential. The scientific-professional audience will know that many variables affect test performance and a young student’s school involvement, and they will expect to see studies showing how these additional influences (test taker’s ethnic background, family’s socioeconomic status, test taker’s first language, quality of schools attended, specific exposures to learning experiences outside of school [computer games, museums, sports, …], and the like) are related to the norm-based IQ score and the corrected IQ score. The scientist-professional audience will want to know how scores from group tests of intelligence (the Army Alpha, the SAT, etc. etc.) correlate with the norm-based and the corrected IQ scores derived from individual testing. Does the corrected score make a practical difference? Flynn provides none of this information in this book.
Another audience for Flynn’s book is the audience of educators who need to know which students can perform well in the normal school classroom, which students need to be in classrooms teaching for students with special needs (just what are this test-taker’s special needs?), and which students can be helped only by one-on-one education and coaching along with his/her caretaker’s insight into the level of life’s self sufficiency such a student will be able to achieve. Flynn’s book provides none of this information.
Another audience for Flynn’s book is the parents of students, particularly young students, who are experiencing difficulties in school and need to understand what IQ means, how knowledge of the student’s IQ can be helpful, and just what kinds of educational guidance and educational experiences their child may need to grow into a self sufficient adult able to function in adult society. Flynn provides none of this information in this book.
Flynn makes technical errors for which this reader cannot forgive him. He writes, for example (p 79-80), “By adding his triarchic measures to the traditional predictive variables of high-school grades and SAT (Scholastic Aptitude Test) scores, [Sternberg] increased the percentage of variance explained from .159 to .248. Which is to say that the correlation between the predictive measures and university grades increased from .40 to .50.” The percentage of variance explained by a correlation of 0.40 is 16%, not “.159” as reported by Flynn. The percentage of variance explained by a correlation of 0.50 is 25%, not “.248” as reported by Flynn. Flynn seems not to know the difference between a percentage and a proportion and how they get expressed in numbers. Most competent writers for the behavioral and management sciences understand that the percentage of “variance explained” typically has enough error in its estimate that reporting the result to three significant digits (Flynn’s “.159” instead of the “16” I have used in this paragraph) cannot be justified. The unhappy truth is that many authors make the same error. Flynn’s statement on pages 79-80 is not the only place in the book in which Flynn makes both these errors (incorrect reporting of percentage; reporting results to a degree of accuracy greater than is justified by the underlying data). Furthermore, the SAT scores observed when the test-taker is a high school junior commonly correlate about 0.60 with the student’s Grade Point Average (GPA) in the first year of college as observed two years later … explaining roughly 36% of the variance in GPA. Do Sternberg’s measures of intelligence add 25% more variance explained to the 36% already explained by the SAT and high school grade point, bringing the total to 60% explained? That’s highly unlikely. Flynn clearly is not an accurate mirror for Sternberg’s results. When an author, Flynn in this case, does not have the technical details of his science in hand, one wonders if he is a competent viewer and critic of the larger picture.
Flynn worries about how the outcome of individual cases of capital punishment – under which state law allows as an acceptable defense the argument that the accused is “mentally retarded” and therefore is not responsible for the criminal act for which he/she is being charged – can often hinge on the adult measurement of the IQ of the accused. Flynn properly worries that an IQ “inflated” by the “Flynn effect” (by having used norms for evaluating the test score of the accused that assign to the accused an IQ that is falsely high and, thereby, condemn him/her to death row when an “accurate” IQ would indicate he/she was not responsible for his/her actions and therefore should not be put to death) can be the cause of genuine miscarriage of justice as justice is defined by U.S. law and practice. This reader understands and applauds that concern. Of course, if societies eliminated the death penalty altogether, as many think should happen, this concern disappears.
Flynn spends a few pages discussing aspects of “emotional intelligence” that Goleman (1995) proposes as well as the “triarchic mind” that Sternberg (1988) proposes. As a professional psychologist, I found these pages interesting reading. However Flynn does not introduce the reader to the means for measuring these aspects of intelligence nor does he present data showing how these aspects of intelligence relate to IQ and larger complex of skills described by the factor analysts. In fact, Flynn does a good deal of bashing of factor analytic approaches to understanding variables that reflect different aspects of intelligence without explaining why he’s so critical of the factor analysts. Flynn acknowledges that “g,” a product of the work of factor analysts, has proved to be scientifically and socially useful.
In short, I can name few audiences for which this work by Flynn is a sure fire “good read.” I am pleased to have purchased and read the book since it introduced me to work about which I knew little. While I’m grateful for the introduction, I’ve not learned what I need to have learned about those topics new to me, Flynn being such an incomplete, un-thoughtful reviewer-reporter.
Intelligence and its implications are such emotion-laden topics that no writer known to me in the last half century seems to be able to handle them. [This review has been shortened by Amazon.com's limits on review length.] 12/29/13 PFR