Searchable C/C++ DB surpasses 275 million lines

Follow Slashdot blog updates by subscribing to our blog RSS feed

Searchable C/C++ DB surpasses 275 million lines 328

Posted by Hemos on Monday December 05, 2005 @01:27PM from the interesting-applications dept.

Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."

This discussion has been archived. No new comments can be posted.

Searchable C/C++ DB surpasses 275 million lines

Load All Comments

Search 328 Comments Log In/Create an Account

Comments Filter:

Some statistics to get you started (Score:5, Funny)

by Anonymous Coward writes: on Monday December 05, 2005 @01:28PM (#14186064)
I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code.

The following "interesting statistics" come to mind:
- Percentage of functions named "deepThroat" (0%)
- Number of comments mentioning a "girlfriend" (11) or "wife" (29) to "Natalie Portman" (41)
- How many variables named "penis" are of type "long" versus type "short" (unknowable!)
You gotta get the variables searchable. Most critical for that last statistic. Also, I'm too lazy to learn Lucene Query Parser Syntax [apache.org], so the statistics for "Natalie Portman" may include references to "portman."
Share
twitter facebook
useful statistic (Score:5, Funny)

by kunzy ( 880730 ) writes: on Monday December 05, 2005 @01:30PM (#14186081) Homepage

the time from the frontpage acticle on /. to the death of your server?

Share
twitter facebook
- Re:useful statistic (Score:5, Funny)
  
  by Sembiance ( 124190 ) writes: on Monday December 05, 2005 @01:33PM (#14186109) Homepage
  
  Well, it's been about 2 minutes on slashdot... my site is already dead. So uhm... 2 minutes?
  
  Parent Share
  twitter facebook
  - Re:useful statistic (Score:2)
    
    by Guysmiley777 ( 880063 ) writes:
    
    Mmmm, it smells like burning.
  - Re:useful statistic (Score:5, Funny)
    
    by Baricom ( 763970 ) writes: on Monday December 05, 2005 @02:11PM (#14186463)
    
    So uhm... 2 minutes?
    
    Sounds like you should have written it in C++ instead of a laggard language like PHP ;).
    
    Parent Share
    twitter facebook
    - - Re:useful statistic: parent: -1 troll (Score:4, Funny)
        
        by Baricom ( 763970 ) writes: on Monday December 05, 2005 @05:12PM (#14188217)
        
        That "woosh" sound you hear is the wink emoticon zooming over your head, joke in tow.
        
        I know PHP is a great web language and that it probably isn't the cause of the slowdown. Heck, even Yahoo! uses it these days.
        
        I was attempting (unsuccessfully, it seems) to make fun of the purists who insist that robust web applications must run on something compiled in order to reach acceptable performance under high load.
        
        Parent Share
        twitter facebook
My vote is for... (Score:5, Insightful)

by Anonymous Coward writes: on Monday December 05, 2005 @01:31PM (#14186090)

How many lines consist of:
}

Share
twitter facebook
- Re:My vote is for... (Score:2, Funny)
  
  by Anonymous Coward writes:
  
  Probably about as many lines consist of: {
  - Re:My vote is for... (Score:2, Insightful)
    
    by Triple Click ( 898568 ) writes:
    
    Depends whether you do this:
    
    if (cond) {
    }
    
    or this:
    
    if (cond)
    {
    
    }
    - Re:My vote is for... (Score:5, Interesting)
      
      by baadger ( 764884 ) writes: on Monday December 05, 2005 @03:57PM (#14187459)
      
      Theres an idea right there, how about some stats showing popularity of various coding conventions?
      
      Variables: under_score vs. camelCase
      
      Tabs vs. spaces
      
      "if (cond) {" vs. "if (cond)\n{"
      
      How many coders bother enclosing single conditionally executed statements with {}
      
      How many coders bother producing comments directly before or after function definitions, describing function implementation?
      
      Lines of comments to lines of code ratios
      
      Number of functions to lines of code ratios for various projects?
      
      Number of projects making use of global variables?
      
      C, to C++, to C# (if your engine covers it) project ratio
      
      etc
      
      Parent Share
      twitter facebook
- Re:My vote is for... (Score:5, Interesting)
  
  by epiphani ( 254981 ) writes: <epiphani&dal,net> on Monday December 05, 2005 @01:43PM (#14186223)
  
  Same type of thing, but indenting styles. K&R vs. BSD, ect. I'm curious how that breaks up.
  
  (Partial to BSD style myself..)
  
  Parent Share
  twitter facebook
  - Re:My vote is for... (Score:2)
    
    by sparkes ( 125299 ) writes:
    
    That would be interesting but a bugger to search for
    - Re:My vote is for... (Score:5, Funny)
      
      by mebollocks ( 798866 ) writes: on Monday December 05, 2005 @02:33PM (#14186651) Homepage
      
      I dunno, maybe you could find the algorithm on the net somewhere? ...if only there was some kinda searchable code database of some sort...
      
      Parent Share
      twitter facebook
- or "// FIXME" (Score:5, Funny)
  
  by StandardDeviant ( 122674 ) writes: on Monday December 05, 2005 @02:20PM (#14186538) Homepage Journal
  
  (subject says it all ;))
  
  Parent Share
  twitter facebook
- - Re:My vote is for... (Score:2)
    
    by Erioll ( 229536 ) writes:
    
    Most closing-statements require no semicolon. While things like class definitions, structs, etc, DO, "typical" programming blocks do NOT, like if, while, and switch blocks. Even functions don't terminate their blocks with a semicolon.
    
    So I'd suspect lines with purely "}" and whitespace would be quite a few.
Similarity checking (Score:5, Funny)

by roguerez ( 319598 ) writes: on Monday December 05, 2005 @01:31PM (#14186093) Homepage

Find similarities with stuff like SCO.

Share
twitter facebook
- Or similarities between different projects (Score:2)
  
  by Jamie Lokier ( 104820 ) writes:
  
  Including those with incompatible licenses.
  
  Related: having found similar code sections, follow trends in them over time. Find where two programs copied the same code, but one has failed to implement what might be a bug fix or improvement in another, by looking at changes to the code over time.
Interesting stats (Score:5, Interesting)

by sparkes ( 125299 ) writes: on Monday December 05, 2005 @01:32PM (#14186097) Homepage Journal

How many lines contain expletives?

Share
twitter facebook
- Re:Interesting stats (Score:5, Informative)
  
  by moosesocks ( 264553 ) writes: on Monday December 05, 2005 @03:05PM (#14186951) Homepage
  
  How many lines contain expletives?
  
  for your reading pleasure [vidarholen.net].... the linux kernel fuck count
  
  Parent Share
  twitter facebook
SCO (Score:2, Funny)

by cmburns69 ( 169686 ) writes:

With all that code indexed, maybe we'll finally be able to figure out what the heck SCO's talking about.

But then again, probably not...
One word (Score:2)

by OverlordQ ( 264228 ) writes:

. . . well program, sloccount [dwheeler.com]. Of course, do some research and tweak the paramaters to get a reasonably accurate result though.
Statistics: (Score:5, Interesting)

by duckpoopy ( 585203 ) writes: on Monday December 05, 2005 @01:32PM (#14186104) Journal

1. Lines per function
2. Comment / command ratio
3. Number of curse word variable names

Share
twitter facebook
- Re:Statistics: (Score:2, Insightful)
  
  by gronofer ( 838299 ) writes:
  
  4. The number of times the wheel has been reinvented.
- Re:Statistics: (Score:3, Informative)
  
  by Anonymous Coward writes:
  
  From the stats page if you cannot get to it...
  
  Overall Stats
  Number of Packages: 10,931
  Total Number of Files: 1,151,819
  Total Lines of Code (No comments, no blank lines): 283,119,081
  Total of All Lines: 420,355,464
  Total Number of Functions: 7,782,468
  Total Number of Functions Called: 69,500,700
  Total Number of Macros: 9,947,564
  Total Number of Classes: 209,361
  Total Number of Comments: 38,125,107
  Total Number of Structures: 5
  - Re:Statistics: (Score:3, Funny)
    
    by maxwell demon ( 590494 ) writes:
    
    Total Number of Functions: 7,782,468
    Total Number of Functions Called: 69,500,700
    
    So the code calls 61,718,232 functions which don't even exist?
    
    But maybe they just meant "Total Number of Function Calls" :-)
- Measurements I have made (Score:5, Insightful)
  
  by derek_farn ( 689539 ) writes: <derek&knosof,co,uk> on Monday December 05, 2005 @01:47PM (#14186262) Homepage
  
  Source code usage measurements contain many surprises (ie, developers don't always write what people think they do). Some statistics I have collected, on a smaller code base, are available here [coding-guidelines.com]. The source of the tools used to exract much of the data (at least for those tables and figure I produced) is available here [knosof.co.uk] (C only at the moment).
  Being able to search so much source is also very useful. I was involved in a discussion a while back about the frequency of use of bessel functions in programs (I claimed rare). The handful of uses returned from your database helped back up my argument (dare I say prove it).
  Keep up the good work!
  
  Parent Share
  twitter facebook
- Need to watch those stats (Score:3, Funny)
  
  by Quiet_Desperation ( 858215 ) writes:
  
  For example, "Lines of code" / "Lines of commenting" will always produce "Inf"
- Re:Statistics: (Score:2)
  
  by dkleinsc ( 563838 ) writes:
  
  Some other real suggestions of useful statistics:
  1. Maximum brace nesting level for each function (might be difficult, but a good metric for determining the complexity of a function)
  2. Percentages of control structures that are while, for, switch, if, etc.
  3. Number of embedded constants that aren't 0 or 1
  4. Count of references to each function/constant within in a single project
- Re:Statistics: (Score:2)
  
  by bob_jordan ( 39836 ) writes:
  
  4. How many lines belong to SCO.
  5. ?
  6. Profit
  
  Bob.
  
  (where 5 is a pretty good chance of getting counter-sued out of existance by IBM when the answer is some { and a few less }.)
- - Re:Statistics: (Score:2)
    
    by AdamWeeden ( 678591 ) writes:
    
    Well if we compile them in Windows, all of them. ;)
ratio (Score:5, Funny)

by FreeBSDbigot ( 162899 ) writes: on Monday December 05, 2005 @01:33PM (#14186106)

... of "foo" to "bar."

Share
twitter facebook
- Re:ratio (Score:5, Funny)
  
  by ahem ( 174666 ) writes: on Monday December 05, 2005 @03:24PM (#14187127) Homepage Journal
  
  From google:
  
  Search -- foo -> Results 1 - 10 of about 26,600,000 for foo. (0.06 seconds)
  Search -- bar -> Results 1 - 10 of about 385,000,000 for bar [definition]. (0.16 seconds)
  Search -- foo bar -> Results 1 - 10 of about 7,900,000 for foo bar. (0.12 seconds)
  
  'bar' wins. This intuitively makes sense, as who would want to go to the 'foo' for a drink, or eat an 'energy foo'? Could you imagine a lawyer being 'dis-fooed'?
  
  Parent Share
  twitter facebook
Suggestion (Score:5, Funny)

by lbmouse ( 473316 ) writes: on Monday December 05, 2005 @01:33PM (#14186120) Homepage

"I'm currently looking for suggestions..."

How about a new server?

Share
twitter facebook
Slashdot Block (Score:3, Interesting)

by Yerase ( 636636 ) writes: <randall...hand@@@gmail...com> on Monday December 05, 2005 @01:34PM (#14186125) Homepage

I love the GeShi page, how it blocks everything from Slashdot. Setup a site to advertise a product, then restrict people from using it....
URLs on this server linked by slashdot.org will be refused. Permission is given to slashdot to mirror content as necessary for the purpose of providing its users access to the information on the site. Slashdot should not attempt to bypass the referer block. Use of the google cache page for the site is acceptable as long as the page(s) concerned have no more than 1 image.

Share
twitter facebook
- Re:Slashdot Block (Score:3, Insightful)
  
  by lowrydr310 ( 830514 ) writes:
  
  This policy is employed for the sole purpose of avoiding a huge bandwidth bill that I would have to pay out of my own pocket. Anyone who would like this restriction to go away is more than welcome to send me bucketloads of cash.
  If you don't want to pay a big bandwidth bill then don't run a webserver.
  - Re:Slashdot Block (Score:2, Insightful)
    
    by b4k3d b34nz ( 900066 ) writes:
    
    Why would anybody WANT to pay a big bandwidth bill? It's called being smart so that he doesn't get the shaft when he has to pay his utilities this month.
  - Re:Slashdot Block (Score:2)
    
    by Anonymous Brave Guy ( 457657 ) writes:
    
    If you don't want to pay a big bandwidth bill then don't run a webserver.
    
    If you want access to a web server, don't run a system that's known to give the provider big bandwidth bills.
    
    At the end of the the day, they don't owe you anything, and anything they offer you is a courtesy, not an obligation. If you don't like that, please feel free to go create and finance your own WWW.
  - Re:Slashdot Block (Score:3, Insightful)
    
    by gstoddart ( 321705 ) writes:
    
    This policy is employed for the sole purpose of avoiding a huge bandwidth bill that I would have to pay out of my own pocket. Anyone who would like this restriction to go away is more than welcome to send me bucketloads of cash.
    
    If you don't want to pay a big bandwidth bill then don't run a webserver.
    That's a little harsh don't you think?
    
    It's one thing to run a site and have reasonable expectations of having "enough" bandwidth for your projected traffic, and it's another thing to pay for a slashdotting on a
  - Re:Slashdot Block (Score:3, Informative)
    
    by Kjella ( 173770 ) writes:
    
    If you don't want to pay a big bandwidth bill then don't run a webserver.
    
    For every problem, there is a solution that is simple, elegant and wrong. In every other market, the more demand there is, the higher the price/revenue/profit. Web servers are pretty much the only place where you lose more money the more popular you are (e-commerce sites and such not included). If so many people want the content, they can find a way to share it. Even then they're getting a bloody good deal, if you ask me. What exactly
- Hit Refresh (Score:5, Informative)
  
  by everphilski ( 877346 ) writes: on Monday December 05, 2005 @01:44PM (#14186229) Journal
  
  Just hit refresh and the webserver won't get the HTTP_REFERRER (granted you'll have to manually delete the text file he serves you)
  
  -everphilski-
  
  Parent Share
  twitter facebook
  - Re:Hit Refresh (Score:3, Funny)
    
    by sglane81 ( 230749 ) writes:
    
    Actually, if you click refresh on a page from a link, it will resend the referrer as well. Most browsers do this. One more thing, you spelled HTTP_REFERRER correctly, which is wrong :) It's spelled HTTP_REFERER, only has one R. Reverse grammar nazi FTW?
- Re:Slashdot Block (Score:2, Interesting)
  
  by wampus ( 1932 ) writes:
  
  Thats why I use Cacheout [thetechgurus.net]. Its a Firefox extension that adds a context menu item to coralize any link. Bypass the restriction AND not kill the site, all at the same time.
Choice of db? (Score:4, Interesting)

by Anonymous Coward writes: on Monday December 05, 2005 @01:35PM (#14186137)

So, this is not a flame, but I'm curious about your choice of dbs.
I've used mysql for some small projects, but generally it does handle
millions of rows (although the upper limit on rows can be patched with
some additional behaviors). So, for big dbs, I use postgresql.

How did you decide to use mysql? (Was it that the project started,
and grew, or did you know it would handle large numbers of rows
from the start)?

Just curious. This is probably going to be viewed as a flame by many
(particularly those who don't really use dbs very much, but use them
enough to have strong opinions).

Share
twitter facebook
- Re:Choice of db? (Score:4, Informative)
  
  by Sembiance ( 124190 ) writes: on Monday December 05, 2005 @04:18PM (#14187649) Homepage
  
  I've used MySQL in the past for some projects at work, where the number of rows were several hundred million and ran with no problems so I knew it was capable of large row numbers.
  
  I initially used their FULLTEXT indexing as well, but it dies a horrible death with a large number of rows or search terms. (The developers that live in #mysql on Freenode confirmed this)
  
  So I had to hand off searching to Lucene, which worried me a great deal (being java) but as folks tell me 'Java is not slow'.
  They are right, Java is very fast at handling the searching and I've been very impressed.
  Most searches in the Java database only take one or two seconds.
  The MySQL query/join for additional info take another 4 or 5 seconds.
  
  Most searches take about 8 seconds to come up, even under no load.
  
  I simply don't have enough RAM to keep the necessary MySQL indexes in RAM and use index only queries.
  
  Parent Share
  twitter facebook
Statistics TM (c) (Score:5, Interesting)

by chunews ( 924590 ) writes: on Monday December 05, 2005 @01:38PM (#14186162)

It would be interesting to see the number of different copyright notices contained within all that source code, and then to present the notices in groups, like GPL GPL2, etc..
Also, I would really like to find "patient 0" for sourcecode. For example, is there a common library or utility function (perhaps Hex2Ascii?) that *everybody* uses? Well, who wrote it first?
And in a similar vein, who are the "top 5-10-100" authors of open source code by use, reuse, KLOC, etc.. Not of too much use unless I were awarding the Nobel prize for programming, or perhaps creating a list of individuals for the RIAA to sue, after their done with their other useless lawsuits. :)

Share
twitter facebook
Interesting Statistics (Score:5, Interesting)

by iso-cop ( 555637 ) writes: on Monday December 05, 2005 @01:39PM (#14186174)

In the software engineering world, people will be interested in all sorts of code metrics such as cyclomatic complexity, operator/operand counts, lines of code per module, and such as well as object oriented metrics for the C++ code (depth of inheritance, for example). If you can marry these sorts of metrics with defect data (bugs) for each of the modules then you have a useful data repository for predicting defects in source code. Keeping around different versions of modules changed is also valuable here. If you can gather information on how long it took to produce the module and how long it took to correct defects in the module you are getting even better. If you make it easy to reuse the C and C++ modules...even better.

Share
twitter facebook
Amazon style statistics (Score:5, Interesting)

by tod_miller ( 792541 ) writes: on Monday December 05, 2005 @01:39PM (#14186180) Journal

I was very impressed with Amazon, who for each book say which phrases and words were particularly unique to that book. (reminds me of that google game where try try and get any two words with only 1 hit).

So show code with coloured background to the lines, from green to red, green being 'normal every day boiler plate' code, red would mean this code must be more specialised, or written by some half-wit l33t h4x0r at least.

I forgot what they called it, but they had 3/4 visible stats based on the semantics of the stuff, probably more under the 'hood (omg lol).

word. Oh some adhesion stats would rock!

please type the word in this image: adhesion
random letters - if you are visually impaired, please email us at pater@slashdot.org

Share
twitter facebook
The basics and more (Score:2, Insightful)

by PetriBORG ( 518266 ) writes:
Start with the basics, and then move on..
1. Whitespace to code ratio
2. Counts for each of the dirty 7
3. Line counts that just contained () or {} or []
4. A list of projects the code is from
5. And then more interestingly, I'd like to run some sort of program on it to find similarities in code, to see how much one code base overlaps with another. It would be interesting to see if OSS actually does share code between projects or if its all NIH (not invented here).
interesting stat (Score:3, Funny)

by bsdluvr ( 932942 ) writes: on Monday December 05, 2005 @01:45PM (#14186241) Homepage

1) randomly select 2000 lines of code
2) compile
3) execute
4) ???????
5) PROFIT!

Share
twitter facebook
Woman (Score:2, Funny)

by chris_mahan ( 256577 ) writes:

I'd like to know whether the word "woman" appears anywhere, and if so, in what projects.

Eh.
Unfortunately (Score:2)

by aztektum ( 170569 ) writes:

All the code was just /.'ed into oblivion. Time to start from the beginning all over again. :(
Sounds kind of like the PMD scoreboard... (Score:5, Interesting)

by tcopeland ( 32225 ) * writes: <tom AT thomasleecopeland DOT com> on Monday December 05, 2005 @01:48PM (#14186268) Homepage

...that is, a static analysis of a bunch of Java SourceForge projects [sourceforge.net]. It does unused code and duplicate code detection... sometimes it finds some interesting things.

PMD home page is here [sf.net], book site is here [pmdapplied.com].

Share
twitter facebook
cout "why bother" (Score:2)

by micromuncher ( 171881 ) writes:

I'm currious, when people are looking for code, what do they do as a first resort? Maybe this should be a poll. Me, I'm a bit funny...
1) look in my library (books)
2) do a deja search
3) ask smarter people than me
4) do a web search (usually on specific sites)
Find all buffer overflows please (Score:2)

by G4from128k ( 686170 ) writes:

I can only hope that this database has good metadata on which code fragments contain/don't contain various common species of exploits (buffer overflow, stack overflow, mal-formed input vulnerabilities, etc.). It would be nice to know which code fragments have all the needed input/size checking needed to be safe for exposure to the outside world and which are "for internal use only."
Not working well -- TRY AGAIN LATER (Score:2)

by putko ( 753330 ) writes:

It is hosed.

I tried searching. Here's what I got:

XML Parsing Error: junk after document element Location: http://csourcesearch.net/performSearch.php?type=Fu nctionTypeReturned&search=(&ignoredRandomNumber=11 33805159922.7798 [csourcesearch.net] Line Number 2, Column 1:Warning: mysql_connect() [function.mysql-connect [slashdot.org]]: Can't connect to MySQL server on '127.0.0.1' (4) in /home/csourcesearch.net/include/php/GraphXML.php on line 309
^
Please check for this: comma in brackets in C++ (Score:5, Interesting)

by Animats ( 122034 ) writes: on Monday December 05, 2005 @01:58PM (#14186354) Homepage

C++, for historical reasons dating back to C, has wierd semantics for commas in brackets. The operator precedence for commas is different inside of "()" and "[]".
So tab(i,j) is a function call with two arguments. But tab[i,j] is an invocation of the "comma operator", then a function call with one argument. The default "comma operator" ignores the first argument and returns the second. It once had some uses in C macros.
I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++. But there's a concern that somewhere, someone might have code that depends on the current semantics of the comma operator inside square brackets.
This new archive offers the opportunity to eliminate that possibility. So, do this search: Find, in non-comment standard C++ code, any occurences of a comma operator within square brackets. Eliminate any where there are parentheses within the square brackets enclosing the comma. Can you find any? In any production code? In any open-source project? Anywhere?

Share
twitter facebook
- Re:Please check for this: comma in brackets in C++ (Score:3, Insightful)
  
  by Vorondil28 ( 864578 ) writes:
  
  I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++.
  
  I'm no C++ expert, but isn't int array[row][col] a multidimensional array?
  - Re:Please check for this: comma in brackets in C++ (Score:2)
    
    by Animats ( 122034 ) writes:
    
    I'm no C++ expert, but isn't int array[row][col] a multidimensional array?
    No, it's an array of pointers to an array of elements, which is not quite the same thing.
    Arrays with multiple subscripts have many uses. Sparse array implementations, for example. People implement this now with code that looks like
    tab(i,j) = 1;
    This is valid C++, and with the right overloads, it compiles and runs, but it looks wierd.
    - Re:Please check for this: comma in brackets in C++ (Score:3, Informative)
      
      by milgr ( 726027 ) writes:
      
      The grandparent got it correct. C does support multidimensional arrays. I suspect that C++ does too.
      To validate, I pulled out my copy of K&R 2nd edition (Actually a copy I once rescued from a trash bin, and my copy is only "Based on Draft-Proposed ANSI C"). In section 5.9 Pointers vs. Multi-dimensional Arrays it points out,
      
      Newcomers to C are sometimes confused about the difference between a two-dimensional array and an array of pointers, such as name in the example above. Given the definitions
      i
      - Re:Please check for this: comma in brackets in C++ (Score:2)
        
        by unsinged int ( 561600 ) writes:
        
        You're correct, but that's not what the original post is saying. The only way to provide a sparse-matrix class in C++ is with member functions. You can't do it by overloading [] to accept two arguments, e.g. array[2,4]. You have to use a member function, making it look like array.get(2,4), or perhaps overloading () for array(2,4). There's no way to write a matrix class that uses square brackets for indexing more than one dimension.
        
        Re:Please check for this: comma in brackets in C++ (Score:4, Interesting)
        
        by The boojum ( 70419 ) writes: on Monday December 05, 2005 @03:19PM (#14187087)
        
        I was just going to point this out. I even hacked up a simple example to show it:
        
        struct location { int dimension, coordinates[ 20 ]; location( int first_coordinate ) : dimension( 1 ) { coordinates[ 0 ] = first_coordinate; } location &operator,( int const right ) { coordinates[ dimension++ ] = right; return *this; } }; struct array { int matrix[ 100 ][ 100 ]; int &operator[]( location const &right ) { return matrix[ right.coordinates[ 1 ] ][ right.coordinates[ 0 ] ]; } }; int main( int argc, char **argv ) { array blah; blah[ 5, 5 ] = 10; }
        
        Proof of concept and it doesn't really do anything, but it compiles just fine. I don't see a problem here. A real implementation would probably do some clever stuff so that the optimizer can optimize away the intermediate data structure.
        
        Parent Share
        twitter facebook
        
        Proposed workaround doesn't work (Score:4, Informative)
        
        by Animats ( 122034 ) writes: on Monday December 05, 2005 @04:31PM (#14187773) Homepage
        
        Yes, that compiles and runs, but it doesn't do what you think it does. Put in some debug print to see what's actually happening, which is this:
        
        "5,5" is evaluated using the built-in definition of ",", returning "5". The no-conversion built-in operator comma has higher priority than the conversion sequence involving a conversion to "location", then the use of the overloaded comma operator. So the built-in comma operator is used. See the discussion in the C++ ARM, section 13.2, "Argument matching": which says "consider an exact match better than any conversion".
        
        "5" is converted to type "location" by the constructor for "location", resulting in a "location" object with "dimension=1" and "coordinates[0]=5".
        
        This "location" object is passed to "operator[]", which then accesses "coordinates[1]", an uninitialized value, which it then uses as a subscript, returning a reference to a arbitrary memory location. So, instead of returning "&blah.matrix[5][5]", it returns "&blah.matrix[???][5]". The example program seems to run in VC++ only because that part of memory happens to be 0 at startup, so this returns "&blah.matrix[0][5]". In other circumstances, it might cause a crash.
        
        "10" is stored into the wrong location of "blah",or outside it, due to the bad reference generated above.. This is where the buffer overflow occurs.
        
        You can force the conversion with
        blah[ location(5), 5] = 10;
        but that's not useful except to see what's happening.
        You can't overload the built-in operators for built-in types. So overloading, outside of an object, "operator,(int, int)" won't work either.
        Hence the need for a straightforward solution.
        
        Parent Share
        twitter facebook
  - Re:Please check for this: comma in brackets in C++ (Score:4, Informative)
    
    by chris macura ( 899109 ) writes: on Monday December 05, 2005 @02:29PM (#14186615)
    
    Yes, they are. But from an OOP standpoint, it's impossible to create a datastructure that "knows" you're using the [] operator twice. So if you overload the [] operator in an array structure, to get multi-dimensional arrays, you have to nest single dimensions arrays, which is almost always inefficient because the rows (or columns, depending on whether you're row major, or column major) are lying around the RAM (depending on where they were allocated) , rather than a continous chunk like with C. In other words, you can't do something like this in C++: class SmartArray { public: SmartArray(int height, int width); int operator(const int &x, const int &y) const; // ... }; ... SmartArray a(5, 5); a[12, 13];
    
    Parent Share
    twitter facebook
    - Re:Please check for this: comma in brackets in C++ (Score:4, Insightful)
      
      by Old Wolf ( 56093 ) writes: on Monday December 05, 2005 @04:09PM (#14187566)
      
      You can do exactly that -- just write a(12,13) instead of a[12,13].
      This is a great counterexample to the GP. Changing the meaning
      of the comma within square brackets would gain NOTHING and would
      mean every existing compiler is now wrong.
      
      The existing C array type is bad enough as it is, why make it
      even more unwieldy by introducing a new variant? C++ is already
      on the right track: discourage C arrays, and encourage container
      classes that have things like bounds checking and automatic
      memory allocation.
      
      Parent Share
      twitter facebook
- Re:Please check for this: comma in brackets in C++ (Score:3, Funny)
  
  by hikerhat ( 678157 ) writes:
  
  Well, the obscureness of the comma operator is used by C++ recruiters who thinks they are really "clever", and in "clever" C/C++ puzzles on usenet. If you took it away, how would you hire C++ programmers and how would you have fun on usenet?
  Also, C++ programmers are getting really old, and they don't handle change very well.
best_idea_ever (Score:4, Insightful)

by l33t-gu3lph1t3 ( 567059 ) writes: <arch_angel16.hotmail@com> on Monday December 05, 2005 @01:58PM (#14186359) Homepage

charge for a premium service that allows Computer Science and Software Engineering profs to perform a somewhat intelligent search of the code to see just how much of their students' code is lifted off the 'net ;)

Share
twitter facebook
Search for this bug (Score:2)

by ibpooks ( 127372 ) writes:

if( something = something ) ...
See also: Codase.com (Score:3, Informative)

by kriegsman ( 55737 ) writes: on Monday December 05, 2005 @02:00PM (#14186371) Homepage

See also Codase.com [codase.com], another "Source Code Search Engine", which lets you search by method names, class names, variable names, free text, etc..

-Mark

Share
twitter facebook
Koders.com (Score:3, Informative)

by knipknap ( 769880 ) writes: on Monday December 05, 2005 @02:01PM (#14186376) Homepage

Don't know, koders.com [koders.com] supports a lot more languages and also lets you narrow your search to specific licenses. The few extra lines of code just don't seem too do it, especially because such measures highly depend on the chosen method.

Share
twitter facebook
grep++ (Score:2)

by Doc Ruby ( 173196 ) writes:

I'm surprised that Perl's CPAN archive [cpan.org] doesn't have structured searching at smaller granularity than module name or freeform metadata. Maybe once the archives let us find code by content, we'll get version control databases that store each line in a record, each block as references in a separate table, maybe even referential integrity of variables as foreign keys. I'd love my editor to pull code from DB storage, padding whitespace only in the presentation layer per my preferences.

I'd really love to see data
How about a potential buffer overflow index? (Score:5, Informative)

by raddan ( 519638 ) writes: on Monday December 05, 2005 @02:07PM (#14186429)

You can start by seeing how often people use gets(), strcpy(), strcat(), etc... Look for all the fun little common mistakes that people make.

Share
twitter facebook
stats we'd like to see... (Score:5, Funny)

by digitaldc ( 879047 ) * writes: on Monday December 05, 2005 @02:08PM (#14186438)

-# of non-numerical constants
-# of ( ),{ },\ /,#,; characters in code
-time spent debugging/compiling
-total hours spent in production
-gallons of coffee consumed
-hours of daylight seen
-# of relationships destroyed

Share
twitter facebook
Code Styles (Score:5, Interesting)

by ionrock ( 516345 ) writes: on Monday December 05, 2005 @02:09PM (#14186443)

I would love to see if different code styles could be analyzed to see how many peopel use what sort of syntax style. There is camelCase and under_scores but it seems possible to find more complicated trends that might allow reviews to statistically determine what practices really help to make code better.

Share
twitter facebook
Recycling code (Score:2)

by ZachPruckowski ( 918562 ) writes:

How much of this open-source code DB is reusable? Are most of the lines things that have limited applications, or are most of them more general? I mean, if you have 275 million lines, but 175 million lines are code designed to solve one specific problem and can't be easily cross-applied, then it isn't as useful as the statement implies.

That said, congrats on the milestone, and looking forward to hearing of more!
histogram of C reserved words (Score:5, Interesting)

by jab ( 9153 ) writes: on Monday December 05, 2005 @02:14PM (#14186487) Homepage

I'd love to see how one of my programs (stats below) compares to the, uh, national average. 1222 if 638 return 482 static 413 for 399 int 217 const 201 else 194 void 128 char 115 case 112 break 55 default 43 sizeof 37 do 35 switch 27 enum 24 struct 23 while 15 float 14 typedef 10 auto 7 unsigned 6 extern 1 long

Share
twitter facebook
- Re:histogram of C reserved words (Score:2)
  
  by maxwell demon ( 590494 ) writes:
  
  35 switch, but 55 default? Do you have switches with more than one default case, or did I miss another use of that keyword?
- Re:histogram of C reserved words (Score:5, Funny)
  
  by plabtfall ( 859254 ) writes: on Monday December 05, 2005 @03:25PM (#14187139)
  
  Yeah, me too: 2431 int 1802 goto
  
  Parent Share
  twitter facebook
- Re:histogram of C reserved words - well, B .... (Score:3, Informative)
  
  by ignavus ( 213578 ) writes:
  
  auto is a throwback to B days (the language immediately before C). B had no data types (no int, float, double, etc) but did have storage types: auto, static, and extrn.
  
  auto was necessary in B for local variables, as a plain variable name by itself was a valid expression statement (as it is in C), not a declaration (IIRC).
  
  1. foo() { auto bar; ... }
  2. foo() { static bar; ... }
  3. foo() { extrn bar; ... }
  4. foo() { bar; ... }
  
  All mean something different in B: the first three instances of bar are declarations, t
Finally! (Score:2)

by Locke2005 ( 849178 ) writes:

Now SCO will finally be able to find all the code that was stolen from them!
Comments (Score:2)

by Daedala ( 819156 ) writes:

I'm dying to know... What percentage of the code is commentary?

And are there any haiku?
the known answer (Score:2)

by __aaitqo8496 ( 231556 ) writes:

select count(*) from sourcecode where comments > 0
0 row(s) returned

plagerism at its finest [thinkgeek.com]

mod -1 lame
Don't mess around, learn from NLP folks (Score:5, Insightful)

by Xofer D ( 29055 ) writes: on Monday December 05, 2005 @02:37PM (#14186681) Homepage Journal

This is a good opportunity to build complex statistics about the C++ grammar actually used in context. Learn from the NLP [wikipedia.org] people! Parse the whole thing, and start finding common subtrees in the grammar used. Look at common lexical entries between subtrees, so we can make a tool that can help recognize errors by comparing against commonly used C++ grammar fragments. Or do function completion based on what kind of function you look like you're writing. See if you can do alignment with similar languages and do statistical source translation [wikipedia.org]. If you keep information about comments used (and maybe apply some real NLP), you might even have a shot at automatically classifying functions based on their form, and documenting them with simple comments.

If that's too hard, try finding all n-grams [wikipedia.org] instead, at least under some length. That's a lot more useful than just individual tokens or strings.

With a lot of data, you can do very cool things. Don't mess around with string frequency counting. C++ is simple compared to English, do something interesting.

Share
twitter facebook
That's easy: search for known security holes (Score:2)

by CodeShark ( 17400 ) writes:

that permit things like buffer overflows, etc.
Though I don't develop much in C++ currently, and haven't had the time to do anything Linux wise in years, I would love to have an identified location for security-bug free algorithms, etc. that I could use if I need to do more C++ work in the future.
There is boost? (Score:2)

by Cyberax ( 705495 ) writes:

This index doesn't even contain Boost (http://www.boost.org/ [boost.org]) and Loki libraries!

It can't be called 'comprehensive' after that...
Cyclomatic complexity... (Score:2)

by xquark ( 649804 ) writes:

would be a nice feature to have, both average and per project/module basis.
TODOs (Score:2, Interesting)

by mrshoe ( 697123 ) writes:

Counting the number of "TODO"s and "XXX"s in "production" open source code could be interesting.
- Re:And then... (Score:5, Interesting)
  
  by Sembiance ( 124190 ) writes: on Monday December 05, 2005 @01:39PM (#14186176) Homepage
  
  Advertise? No, I'm just a single coder doing this for fun and hope that some people will find it useful.
  
  Parent Share
  twitter facebook
  - Re:And then... (Score:3, Funny)
    
    by guaigean ( 867316 ) writes:
    
    My apologies then. As a regular Slashdotter it is forbidden for me to RTFA.
- Re:Are you proud of 275 million lines of code? (Score:2)
  
  by sparkes ( 125299 ) writes:
  
  Write 'I must read at least the post before I comment' 275 million times and when you are finished you can use slashdot again.
- Re:What? Millions of code? (Score:5, Informative)
  
  by tgd ( 2822 ) writes: on Monday December 05, 2005 @01:49PM (#14186276)
  
  Its a searchable database OF code from other products, containing 275 million lines you can search across.
  
  Its not a searchable database written in 275 million lines of code.
  
  Parent Share
  twitter facebook
- Re:What? Millions of code? (Score:2)
  
  by masklinn ( 823351 ) writes:
  
  Whoa, not only you didn't RTFA (well, that's slashdot so it's ok) but you didn't even read the headline?
- Re:What? Millions of code? (Score:2)
  
  by Shimmer ( 3036 ) writes:
  
  This project didn't write the 275 million lines of code, they collected code written by others.
- - Re:What? Millions of code? (Score:2)
    
    by Quiet_Desperation ( 858215 ) writes:
    
    That's why I moved on to higher level stuff like VB, or RealBasic now that VB has been sucked into the .Net singularity. I don't write 3D games or supercomuter simulations of galactic collisions. Most of what I write is toolware or interfaces to my own hardware designs- very GUI oriented stuff that needs to go from idea to working application in, like, one day. But I still get the "serious coders" asking "why aren't you doing that in C?" Or the message board trolls with "Dur! You couldn't write a FPS in RB!
- Re:275+ million lines (Score:2)
  
  by hritcu ( 871613 ) writes:
  
  line 53256 "Hey pretty lady, are you an astronaut because your ass looks out of this world"
  
  Knowing that there are not so many women writing (or *sigh* reading) open source I think it is very unlikely that adding such line to your source code will get you anywhere. You could try though, and of course tell us what happend :)
- Re:275+ million lines (Score:3, Funny)
  
  by gstoddart ( 321705 ) writes:
  
  How about the % of them that would work on a lady in a bar? line 53256 "Hey pretty lady, are you an astronaut because your ass looks out of this world" ....oh....not those kinds of lines....*sigh* and I thought I was so close
  No, no, no.
  
  You do not use lines 1..N on the same lady until it works. It's not like breaking encryption -- you don't get to try all the possible keys.
  
  I have friends who have done this, and they swear it's a percentage game. Choose one line you like, and try it on women 1..N until it
- Re:Wtf? (Score:3, Insightful)
  
  by Digital Vomit ( 891734 ) writes:
  
  What better reason than to create such a program other than "why not"?
  A person who is a true programmer in his soul doesn't ask himself "why". Oftentimes the sheer joy of creating something from nothing is enough.
- Re:the obvious answer (Score:2)
  
  by iapetus ( 24050 ) writes:
  
  275 million, but I'm not telling you which ones.
- Re:Size doesn't matter (Score:3, Funny)
  
  by kmartshopper ( 836454 ) writes:
  
  It's the quality of the search results that counts.
  
  Yeah, keep telling yourself that...
- Re:Evolution data server and courier imap (Score:2)
  
  by kalislashdot ( 229144 ) writes:
  
  I love Courier, what else is there? UW? The maildir format is pretty awesome.
- ...barely the lobby. (Score:2)
  
  by C10H14N2 ( 640033 ) writes:
  
  Consider, a page is 45 lines, an average book is 350 pages @ about 2" thick (ergo, about 15-16k books), a stack is roughly 12' wide by 6 shelves, double sided (864 books) and a row is about six stacks long (72' / 5,184 books). So, in a compactus, about 432sq/ft, to the 2,100,000sq/ft of the Madison building alone. The total linear capacity is 540 miles. Using the above assumptions, that's about 205 million books, so if printed, this repository would take up roughly 1/13,000th of the space. Imagine if your h

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Some statistics to get you started (Score:5, Funny)

useful statistic (Score:5, Funny)

Re:useful statistic (Score:5, Funny)

Re:useful statistic (Score:2)

Re:useful statistic (Score:5, Funny)

Re:useful statistic: parent: -1 troll (Score:4, Funny)

My vote is for... (Score:5, Insightful)

Re:My vote is for... (Score:2, Funny)

Re:My vote is for... (Score:2, Insightful)

Re:My vote is for... (Score:5, Interesting)

Re:My vote is for... (Score:5, Interesting)

Re:My vote is for... (Score:2)

Re:My vote is for... (Score:5, Funny)

or "// FIXME" (Score:5, Funny)

Re:My vote is for... (Score:2)

Similarity checking (Score:5, Funny)

Or similarities between different projects (Score:2)

Interesting stats (Score:5, Interesting)

Re:Interesting stats (Score:5, Informative)

SCO (Score:2, Funny)

One word (Score:2)

Statistics: (Score:5, Interesting)

Re:Statistics: (Score:2, Insightful)

Re:Statistics: (Score:3, Informative)

Re:Statistics: (Score:3, Funny)

Measurements I have made (Score:5, Insightful)

Need to watch those stats (Score:3, Funny)

Re:Statistics: (Score:2)

Re:Statistics: (Score:2)

Re:Statistics: (Score:2)

ratio (Score:5, Funny)

Re:ratio (Score:5, Funny)

Suggestion (Score:5, Funny)

Slashdot Block (Score:3, Interesting)

Re:Slashdot Block (Score:3, Insightful)

Re:Slashdot Block (Score:2, Insightful)

Re:Slashdot Block (Score:2)

Re:Slashdot Block (Score:3, Insightful)

Re:Slashdot Block (Score:3, Informative)

Hit Refresh (Score:5, Informative)

Re:Hit Refresh (Score:3, Funny)

Re:Slashdot Block (Score:2, Interesting)

Choice of db? (Score:4, Interesting)

Re:Choice of db? (Score:4, Informative)

Statistics TM (c) (Score:5, Interesting)

Interesting Statistics (Score:5, Interesting)

Amazon style statistics (Score:5, Interesting)

The basics and more (Score:2, Insightful)

interesting stat (Score:3, Funny)

Woman (Score:2, Funny)

Unfortunately (Score:2)

Sounds kind of like the PMD scoreboard... (Score:5, Interesting)

cout "why bother" (Score:2)

Find all buffer overflows please (Score:2)

Not working well -- TRY AGAIN LATER (Score:2)

Please check for this: comma in brackets in C++ (Score:5, Interesting)

Re:Please check for this: comma in brackets in C++ (Score:3, Insightful)

Re:Please check for this: comma in brackets in C++ (Score:2)

Re:Please check for this: comma in brackets in C++ (Score:3, Informative)

Re:Please check for this: comma in brackets in C++ (Score:2)

Re:Please check for this: comma in brackets in C++ (Score:4, Interesting)

Proposed workaround doesn't work (Score:4, Informative)

Re:Please check for this: comma in brackets in C++ (Score:4, Informative)

Re:Please check for this: comma in brackets in C++ (Score:4, Insightful)

Re:Please check for this: comma in brackets in C++ (Score:3, Funny)

best_idea_ever (Score:4, Insightful)

Search for this bug (Score:2)

See also: Codase.com (Score:3, Informative)

Koders.com (Score:3, Informative)

grep++ (Score:2)

How about a potential buffer overflow index? (Score:5, Informative)

stats we'd like to see... (Score:5, Funny)

Code Styles (Score:5, Interesting)

Recycling code (Score:2)

histogram of C reserved words (Score:5, Interesting)

Re:histogram of C reserved words (Score:2)

Re:histogram of C reserved words (Score:5, Funny)

Re:histogram of C reserved words - well, B .... (Score:3, Informative)

Finally! (Score:2)

Comments (Score:2)