Using Bags For Privacy Purposes

Continued from BagSetImpedanceMismatchDiscussionTwo

Further, sometimes it's done for security/privacy reasons. You may want to supply a subset of employee info to an outside firm, but omit the employee number because it's not needed for the purpose of the request and a possible privacy violation.

Newspapers are asking for such info of government agencies of late to see if gov't employees are over-paid because Rush Limbaugh etc. are making it a political issue. (Rush and the subject of "bags", how fitting.) A newspaper doesn't need to see the employee number. The data supplier could generate a dummy key as it's being "saved", but it would be an unstable key and lead us to the same problem. The only full solution I see would be to make a surrogate key that "stays with" a given employee for the duration of the record in the employee master table. But that puts an extra burden on the data supplier. It's another key and index that has to be carried around. It's a lot of work just to satisfy the Set Gods.

Further, the newspaper could use it to make inferences about promotion patterns because the key makes an entity instance track-able over time, which is not the stated goal of sharing the info. The gov't agency, employees, and unions will want to supply only the bare minimum info necessary to satisfy the request. A stable keys goes outside of that. -t

  Salary_Info table
True, the paper may be able to also make promotion inferences using a guess-a-tron to match prior data sets with say 90%+ accuracy, but still nobody will want to give them more than absolutely necessary.

Using a true relational system like the RelProject, the above can be easily generated as output from a relation or imported/linked and a generated key produced to input it as a relation. There's nothing here to indicate why we'd want to actually keep it as a bag. You haven't indicated, anywhere on this Wiki, why the RelProject -- or any other true relational system -- should need, or even want, to manipulate bags beyond being able to import bags into relations and export relations into bags in order to trivially accommodate the occasional flawed external system.

Yes I have, you just ignore them and dismiss economic models for reasons that escape me. Our tools are to do work and deliver product to customers, not make ivory-tower zealots happy. You have a "purpose" problem. Rel is unnecessarily complex in having arrays being separate from tables, etc. It's type-happy. And it's not just about external systems: a query language that wants to challenge SQL should be able to be hooked into an existing RDBMS's engine without having to rewrite existing schemas and views. Being bag-phobic makes this more complicated.

No, what you've shown is only that bags may need to be imported or exported. You've not shown that they need to be maintained or manipulated as bags. And why would they need to be? If you import a CSV file into a DBMS, do you need to keep it as a CSV file in the DBMS?

It's been explained to you multiple times that ARRAY is closer to SQL's CURSOR construct than any persistent structure like a table or RelVar. It is disengenuous to persist in using deliberate "misunderstandings" to try to make a point. A query language that challenges SQL needs to address SqlFlaws. The RelProject does so, and people are using it for practical purposes; no existing schemas or views need be rewritten to link to external SQL DBMSs. At worst, the occasional keyword is needed to identify how external systems should be interfaced. That seems more reasonable than risking incorrect query results in order to avoid the occasional keyword, doesn't it?

I've given a suggestion that would greatly reduce accidental primary key dropping without outright forbidding bags. Your position is an extreme one. And you conveniently ignored the risk of a temporary key being mistaken for a permanent one. Pretending like that risk doesn't exist doesn't make it non-existence. Your bias is showing.

I showed how temporary keys can be trivially excluded in reporting tools, etc., by making them all derived from an UNSTABLE type. This is a non-issue. Furthermore, you've conveniently ignored my inescapable point: "At worst, the occasional keyword is needed to identify how external systems should be interfaced. That seems more reasonable than risking incorrect query results in order to avoid the occasional keyword, doesn't it?" Finally, you haven't dealt with the fact that allowing both bags and relations, in parallel, in a DBMS increases complexity. Complexity, unless definitively necessary, should obviously be avoided.

Who says it's a "report"? Why should we distinguish between one transfer/communication method and another? Why are RDBMS to be "blessed" as holy, but not other techniques? The journalists may load the CSV data into their RDBMS. We probably won't know; we just deliver what was agreed to. And how is UNSTABLE different enough from a keyword to matter? And a person using data not from Rel won't know it has an "unstable" key. Rel is a WalledGarden.

{If they can't answer the question about what they want as a csv file, that is their problem - you ship them a basic CSV file that you can easily create, and if that isn't good for them, they will complain and ask you for a different CSV file, or they will modify the csv data themselves. If they want to customize or change the data to a slightly different format, they can do so. I've dealt with CSV files before that aren't perfect; so I change the column names, fiddle with the data types in the schema that they shipped me, etc. That's another thing - are they shipping a schema with the csv file or just plain csv data with no schema? A lot of important details can be lost without a schema telling you information about the database constraints. CSV files aren't perfect, people can modify the data layout to their needs. }

But "give them what they want and only what they want" would apply to the database side also. If you are a data programmer why should you put in extra info that nobody asked for? If you are out to "fix the world" and inject temporary keys to "make the universe right", wouldn't that also apply to CSV's? If you wish to argue with users and technicians and aggressively insist on "no bags", then to be consistent you should evangelize against bags in all forms.

{The CSV file format doesn't even dictate that you need column names in the file - aren't column names useful in databases? You can optionally put column names at the top of the csv file and your database can import those column names. Are you therefore saying that databases shouldn't even have column names, because the csv file might not have column names so why should databases have column names? Basing your database on csv is bass ackwards. Csv is a physical flexible file format, much too general purpose. Are column names just nanny state dictating your columns? Should columns always be name-less? CSV format could be extended to allow a schema inside the csv file, but since the schema can be shipped separately it might be better to keep csv the way it is.}

If columns names are left off, hopefully it's done for a good reason. Any tool can be misused. Some database tools will still import non-column-named CSV files if I'm not mistaken. They use dummy names such as "field01, field02, field03, etc.". It's a good way to get something up and going in short notice. One can fix the column names later. For example, suppose you only need 4 out of 30 columns in the imported table for a project with a big deadline. You could use the tool to import it using the "field01..." convention, rename the 4 fields of near-term interest, finish your project, and adjust the other 26 column names during a less hurried time. This is called flexibility: LimpVersusDie. Of course such usage depends on the domain. I wouldn't want too much of it for say medical or space applications. It's better to be anal-to-the-max in those domains, even if it takes more time and resources. -t

{But often business apps need to be anal too - for example a bank wants to have correct transactions. A police station bloody well better be correct about the crimes it is documenting - one mistake in the database and that could put a person in jail for a crime he didn't do. It could be as simple as a boolean being set to "Tru" instead of "True" due to a spelling mistake that puts a person in jail for life.. if the database is just strings or blobs, "tru" is an acceptable cell value, which could throw off the entire logic of a query. With an AnalBucket that only accepts TRUE or FALSE and not "tru" or "fall", the app has more integrity. The idea that business apps need not be reliable or "anal" is nonsense.}


I developed business applications for 25 years. I never encountered a circumstance where "correct" wasn't a top priority. If "correct" didn't matter, they wouldn't have bothered obtaining data in the first place, let alone expended the resources to analyse it.

{DevilsAdvocate and OffTopic: The only time I've found "correct" becomes murky (correct not a top priority), is with graphics or gui. Sometimes I've sat around debating for hours with myself whether or not to use a gray OK button, or a black one. There is no correct artistic quality to an application. To avoid this problem, I settle on defaults.. I use the default operating system colors instead of my own.}

Indeed, HCI issues may be murky, but the issue here is correct vs incorrect data. It is not murky at all. The simple fact is that BagAtational systems like SQL permit errors due to duplicate rows. Those errors simply cannot occur in true relational systems. However, true relational systems impose no more additional effort over BagAtational systems than adding a keyword or two to correctly import or export bags when these are (occasionally) required. The advantage, in all conceivable respects, clearly lies with true relational systems. The notion that adding a keyword or two (in rare conditions!) represents a significant economic cost, or that generated keys escaping into the wild represent a significant business risk, is ludicrous.

What language has such a feature? Rel didn't the last time we came this issue. I smell exaggeration. Going through extra steps to get a bag when needed can create errors. More steps is often more errors. And you haven't shown that mistaking temporary keys for stable keys is sufficiently rare. You only claim it. Why should I or the reader "just believe it" because you claim it over and over repeatedly in multiple spots redundantly? And you ignore the warning technique I suggested, comparing everything to SQL's way instead.

A version of Rel, currently in testing and tweaking and to be released in a few weeks, has high-level facilities import/link bags and add a generated key via TutorialDee extensions. It's always had low-level facilities to import/link bags as RelVars, and it's always had high-level constructs to emit bags.

You haven't shown that mistaking temporary keys for stable keys would ever occur. I doubt it would.

Your so-called "warning technique" is not even worth comment.

     If most bags are accidental Then
         warning system will greatly help
     Else // most are intentional
         Rel has features to do what SQL does for existing RDBMS
     End If
       [ bag -> relation -> bag ] 
       bag -> [ relation ] -> bag
Now: How, really, would "going through extra steps" to create a bag "create errors"? This is getting ridiculous. Surely you're just quibbling to preserve some idealistic adherence to bags for the sake of "bag purity", or some such, aren't you?

All else being equal, the more programming/query code needed to carry out a task, the more chance for making an error. Perhaps I focus on "productivity purity", eh? -t

So you'd rather allow the developer to frequently create inadvertent and subtle errors that could be pervasive throughout the database, than entirely eliminate those and allow the developer to rarely create explicit and obvious errors only at the boundaries of the system?

You haven't demonstrated that bag errors are more common than other kinds of errors that the tradeoffs bring. Frequency and magnitude of error is indeed important, but your assessment of both differs from mine, especially if the warning system is put in place. You need to flesh out why you think X is more common than Y. I'm not going to just take your word for it, you need to present an explanation for frequency and impact assessments. I cannot read your mind. Repetition of your summary assessment is not sufficient.

In most production databases, are the bulk of tables and queries referring to tables in the database, or outside the database?

"The"? A good many of the systems/processes I work on talk to and connect to a variety of databases and sources; some I can control schema/layouts and some I can't. Thus, tools that work effectively with a variety of databases and data-sources in which we the coder/querier have a varying range of control over are a good thing. A fair amount of unchangable sources are going to be bags either because the schema has bags or the info we can access omits the primary key or info about the primary key.

You haven't answered the question.

I guess I'm not understanding your question then.

You don't have to understand it, just answer it.

The answer is 53.

I can only interpret such an evasive response as clear evidence that you do understand the question and you know the point I intend to make about it, but are unable to provide a cogent counter-argument. In other words, you know that in the vast majority of databases, most queries refer to content within the database rather than external data sources.You also know it is reasonable to assume that errors will be proportional to the relative frequencies of types of query, i.e., errors from queries that reference internal data will contribute more to the overall error count than errors from queries that reference external data. (We'll ignore for now the fact that in this case the former are more subtle and difficult to find than the latter.) Therefore, if there is a mechanism that prevents errors in internal queries, at the expense of introducing errors in external queries, it makes sense to implement the mechanism because it will reduce the overall number of errors.

Personally, I think this is either quite weakly stated, or else I outright disagree. I tried once, to discourage the complacent notion that these 'generated keys' are a mechanism that 'prevents errors'. I'd prefer something like 'makes them easier to fix'. In which case, there would be alternatives, conceivably. I'm not against these generated keys, but you're arguing me out of my complacency about them. I don't take them to be a panacea. That they prevent errors, is not the way, exactly, to put the case for 'generated keys', in my view. Refusing to allow duplicate rows without addressing the situation somehow, should be an easy point--really, why bother to argue for relational theory? And, note, that 'why bother to argue for relational theory' is a little different than 'I want to prevent errors, and so etc.' Relational theory is technical. The arguments for it, do not look like this page, they're mathematical. I'm tempted to say something like don't argue for it, just implement it.

Generated keys are a mechanism that prevents errors by allowing the RelationalAlgebra to be used, instead of requiring some SQLesque hybrid relational/bag algebra. "Makes them easier to fix" does not accurately reflect the issue, because it is truly the case that a category of errors that are possible in SQL cannot occur in a faithful implementation of the RelationalModel. I.e., there's nothing to be "easier to fix" if you can't make a mistake in the first place.

To express my disatisfaction, I'll repost this quote from further above: 'Anyway, let me explain how SQL and Rel conceptually differ: In SQL, the input to a query can be a bag, so a certain category of errors can result. In Rel, the input to a query cannot be a bag, so a certain category of errors cannot result. In all other ways, Rel and SQL can accomplish exactly the same things.' I'd prefer to just leave it at 'In Rel, the input to a query cannot be a bag'. Then, we could focus on how messy that statement already is. What is 'the input to a query'? Query languages allow manipulation and retrieval of data. I'd prefer your point to be, that you're interested in a simple, powerful QL, that has a strong formal foundation based on logic. That's what you want to support. And it will be something, that is not intended to be used for complex calculations, is not 'turing complete', it will support easy, efficient access to large data sets. It will allow for much optimization.

I don't think there's any difficulty understanding, informally, what "input to a query" is.

That's not enough? The focus, for me, should be on what forms the basis for SQL. Relational algebra, and relational calculus. Now, if a query is applied to relation instances, then the result of a query is also a relation instance. I'd put somewhere along about here, something about how the schema for the result of a given query is fixed. Determined. By what? By definition of query language constructs. Relational algebra has some basic operations. It also has, I suppose, additional operations. But each operation returns a relation. That means, that the algebra is closed. And that means, that operations can be composed. Which all sounds quite different, doesn't it, to 'In Rel, the input to a query cannot be a bag, so a certain category of errors cannot result.' Not, I think, actually, the point, I can't wring all this happy crap I'm talkin' about, out of that.

Actually, neither RelationalAlgebra nor RelationalCalculus form a basis for SQL, which is precisely the problem. They are arguably both an inspiration for SQL, but it deviates heavily. SQL allows duplicate rows, for example. However, the focus on this page is not on adherence to the RelationalModel per se. You appear to be suggesting that it is a goal in itself, and that it can be defended mathematically. Be that as it may, it doesn't say -- in pragmatic terms -- why the RelationalModel is a GoodThing. On this page, we've been debating it from a practical point of view. In short, not using the RelationalAlgebra means using a bag-based algebra, which permits errors due to duplicate rows. (Of course, that's essentially true by definition!) SQL is notorious for this, and I gave an example on BagSetImpedanceMismatchDiscussion.

The happy crap that I'm talkin' about, is relational algebra. These are just preliminaries. What a query Is Applied To, that's better than 'the input to a query etc.'. What is a query applied to? schemas of input relations--that's the 'input', perhaps..I dunno, maybe I'm just being a bit more patient than you, slowing down. Was this what you were trying to explain? And what about schemas of input relations for a query, something about them being 'fixed'? And the result of such a query is a relation instance, which is to say, the result of such a query is *also* a relation instance? So? So the schema for the result of a given query is also fixed. Oh yeah, well then what is it determined by? By, see above, it is determined by definition of query language constructs. I'm sure I'm wrong to smugly feel, that this probably sounds to everybody else like Martian to you people, jeez, this is the subject you're debating. These are preliminaries. Before we get into Selection, so certainly before we get into union, intersection, set-difference, we'd get into projection, which I haven't seen this word used here? Because, of course, we all know about projection. So of course, incidentally, we know why projection operator has to eliminate duplicates. I think top is crying for help, he wants to understand this arcane priesthood wisdom, and either you don't have the patience, or it's just actually too technical for all of us.

Before we get into projection, we must understand relational algebra. In any case, the relational model has *rigorously defined* query languages. That point, is simply part of a summary of relational algebra. I don't think I've cleared anything up one bit, have I? :-)

I'm sure Top is well aware of the technical aspects of the RelationalAlgebra. Why it is superior to some bag-based algebra is the issue here, and I argue here that it is superior -- among other things discussed elsewhere -- because it categorically precludes errors due to duplicate rows.
{This is a case where backwards compatibility is evil. Connecting to a variety of databases and remaining compatible with them is a goal of mega backwards compatibility and mega flexibility. Surely there is a pattern describing the evils of backwards compatibility on this wiki somewhere? It goes something like this: we need to remain compatible with ms-dos so let's build windows 98 on top of ms-dos instead of building a proper operating system that replaces ms-dos. Uh oh! How about building a database that remains backwards compatible with ExBase, but also runs on Dos... and truly relational is just a PipeDream because backwards compatibility with SQL is king. Or remaining backwards compatible with old BASIC code which used lots of GoTo and line numbers.. At some point you have to abandon backward compatibility in favor of a new more sterile technology. Luckily CSV offers a way to remain still somewhat backwards compatible since CSV is just dumb text.}

This is a matter of weighing trade-offs. I've spent a good amount of time revising stuff to fit new languages/paradigms/tools/interface. It's costly economics-wise and shouldn't readily be dismissed. And in my opinion Microsoft was wrong to discontinue the VB6 language so suddenly. They could have offered both for a while so existing apps would still have some support. In other words, phase it out gradually rather than yank the rug out all at once.
Figure 4729:

Query language can send or receive a bag to/from a DBMS engine that:

 Query.|.DBMS that...|.DBMS that DOESN'T
 Lang..|.allows bags.|.allow bags 

(Dots to prevent TabMunging)


EditText of this page (last edited March 24, 2012) or FindPage with title or text search