Chuck--all valid points but before we send the Census Bureau off on further disclosure
avoidance tracks, I would like to see someone prove that you can really disclose someone.
Maybe we should have taken part of the $3 Million to produce the CTPP and offered a reward
for anyone who could have taken the 1990 data (CTPP, PUMS and STFs) and identified a
specific individual. Right now, disclosure appears to be more conjecture and theory as
opposed to fact. I would also like to see this examined from the legal side. First, we
know that there have never been any court cases of disclosure, so there is no case law to
draw upon. Second, I can't believe that any attorney would ever take on a case once
he/she learned about all the imputation, data swapping and
other things that go on that raise legitimate threats to the validity of the data--not to
mention that it is only a sample.
As far as finding yourself in the block data, the CB disclosure people will tell you that
they checked for that and swapped data elements around. Although you think it is you in
the block data it really is not. When it was suggested that the CTPP uses all the same
data elements that have been already sanitized, the answer we were given was that they
could not check all our data and somehow it was different.
If we think that rounding is a pain now, with Parts 1 and 2, just wait till you get your
rounded flow data.
Chuck Purvis wrote:
Ed and CTPP-News:
The rounding of values inside the CTPP is, right now, a modest, annoying data processing
issue. As professional data analysts, we are always on the lookout to make sure our
numbers "add up" so that we're not missing anything. Rounding should be a
privilege of the data analyst, AFTER all of the precise number-crunching has been
performed. So, I want to make sure in my data analysis that the year 2000 total population
of my region is ALWAYS 6,783,760. IF IT'S DIFFERENT, THEN I MADE A MISTAKE THAT I HAVE
TO CORRECT. After I get the precise number, then I can do the rounding off to my
heart's desire, that is, 6.8 million persons, or 7 million, whatever. It is annoying,
frustrating, an inconvenience, and a pain to NOT have the numbers add up!
The Census Bureau's use of rounding is an attempt at "disclosure avoidance"
that is, to foil attempts of the data analyst to "reverse engineer" the precise
name, address, and characteristics of individuals and their households. I frankly do not
believe that rounding is the best method for ensuring disclosure avoidance. I believe
other mathematical techniques to "dither" or randomize the reported data would
be more useful, in terms of disclosure avoidance, and useful to the analyst, in terms of
removing all of the rounding errors inherent in the current CTPP. My recommendation to the
Census Bureau: do the right thing and hire mathematicians to find best methods to a)
protect the identity of respondents; and b) to make things easier for the data user.
Frankly, you can use American Factfinder to enter your home address, and get the
block-level population of persons on your block by race, by sex and by age. So then how is
the Census Bureau providing "disclosure avoidance" for standard products like
Summary File #1? If the Census Bureau had implemented rounding on standard census products
such as SF1, SF2, SF3, and SF4 then there would have been a riot among the data users,
Congress would have intervened, and the Census Bureau would be backtracking as fast as you
could say Appropriations Committee.
Right now we have two classes of Census Bureau products: "first class" products
such as the summary files and the Public Use Microdata Sample where there is (thank
goodness!) NO rounding at all. (There are data thresholds in SF2 and SF4, but that's
another matter.) The "second class" products are the CTPP and the EEO files,
where there is rounding of data to the nearest, 10, 15, 20, etc. Perhaps it is the intent
of the Census Bureau to implement rounding in future releases of "regular"
Census Bureau products, such as American Community Survey and 2010 Census short form data.
That would be a big mistake.
The rounding of data in the CTPP guarantees loss of productivity: the data analyst will
lose productivity in terms of always second-guessing the data processing steps (is a tract
or zone missing? are there problems in my computer code?); and the data analyst will lose
time in explaining to data users: WHY THE NUMBERS DO NOT ADD UP!
Try explaining why: 10 + 10 + 10 + 10 = 50 !!!
I have spent too much time over the past 20+ years explaining the difference between
commuters and "home-based work" trips; and "workers at work" and
"total employment." Now, we can be guaranteed to spend a heck of a lot more time
explaining "why don't the numbers add up?" (Does anybody have the home phone
numbers for Census Bureau management?)
Here's a real life example using the CTPP Part 2 data. Let's say my boss asks as
simple question: "How many transit commuters are at work in the Bay Area?" Using
the Part 2 data, I am able to provide my boss 15 different answers!
The short answer is "320 thousand."
The long answer:
In Table 2-2 (Means18) there are five categories of "transit" that need to be
summed to derive "total transit. In table 2-12 (Means11) there are three categories
of "transit" that need to be summed; and in Table 2-27 (Means8) there are two
categories of transit that need to be summed to get total transit. (There are no
"Means5" tables in CTPP2 where "transit" is one, and only one
And there are multiple summary levels where one can derive a regional total count of
transit commuters, including TAZ, block group, tract, county and the "MPO Summary
Level". (Also, the county-place-remainder, the place-remainder-tract, and MSA/CMSA
summary levels can be used to extract more "different answers")
So, the following table illustrates the range of "regional transit commuters"
using the three available means-of-transportation tables, and five of the different
summary levels available in CTPP:
Table 2-2 Table 2-12
N SUMLEV (Transit=5 cats) (Transit=3 cats) (Transit=2 cats)
4,031 TAZ 319,435 319,553
4,384 Blk Grp 319,433 319,521
1,403 Tract 319,717 319,780
9 County 320,116 320,129
1 MPO 320,125 320,120
What this tells us is that the number of "regional transit commuters" working
in the Bay Area is somewhere between 320,118 and 320,122, and it's rounded to 320,120.
All of the other numbers are subject to a modest degree of rounding error.
AND THERE IS A PATTERN!!! There is data "leakage" the more one aggregates from
lower levels of geography, and from greater number of subcategories (e.g., aggregating
from the five transit sub-groups versus the two transit sub-groups.) This data leakage is
hardly statistically significant. It is, however, annoying.
My recommendation to users of CTPP data (Part 1 and Part 2):
1. Obtain your "regional control totals" or "state control totals"
from the most geographically aggregate summary levels, e.g., SUMLEV=040 for states, and
SUMLEV=930 for MPOs.
2. Avoid aggregating (summing together) your geographies whenever and wherever
3. Avoid aggregating categories (e.g., detailed household income versus grouped
household income; means of transportation) whenever and wherever possible. For example, to
get the least affected count of 3-plus carpools, use tables based on Means of
Transportation (8 categories.)
4. Sum as few categories as possible to derive aggregated measures such as "total
transit." For "total transit" use CTPP Part 2, Table 27, where you are only
summing bus/trolleybus to streetcar/subway/railroad/ferry.
5. Adjust (de-round, un-round) as you see fit. Use SF3 or PUMS to provide control totals
to adjust the CTPP Part 1 data.
6. Develop a sense of humor. As I see it, this data rounding is a real joke. Don't
take these data issues too seriously. And it's kind of funny that the numbers
don't add up. Or, as they say: "close enough for government work."
cheers and good luck,
Chuck Purvis, MTC
Question: Using the CTPP, 10 + 10 + 10 + 10 = ?
f) Any of the Above
>> ed christopher <edc(a)berwyned.com>
02/12/04 12:01PM >>>
The rounding within the CTPP data can play heck with
doing any data
analysis. In the Chicago Central Area there are 155 individual TAZs.
If you take a simple table from Part 2, say mode to work by sex, some
interesting things happen. If you sum the total workers using the
"total" field you get 631,999. This becomes an important number because
people like to know the total. However, when you sum all the modes by
zone you get 631,883. This is not a big deal except if you want to show
drive alone, carpool, transit and other with their modal share
percents. In this region, some of us like to see the actual numbers
along with the percents. Logic would say to use the 631,883 when
calculating the percentages but then that means the sum of the totals
(which we know to be the better number because row rounding was applied
to the tables) 631,999 gets tossed aside. One could get creative and
distribute the 116 workers in some weighted fashion which would not
likely affect any percentages but then the next guy who comes along
using the CTPP data and software would get different numbers and we are
back splitting hairs over who got what number from where.
Are others finding the issue of rounded numbers a bit frustrating,
especially when it comes to aggregating TAZs?
I suppose one way to deal with this would be to simply round everything
to the nearest 100¯Even the 1980 and 1990 data like it appears Chuck
Purvis, MTC, has done with his Commuting to Downtown trend analysis.